Early vision processes, based on human visual system (HVS) performance, provide insufficient information for modeling our assimilation of image sequences (e.g. video). The use of a visual attention paradigm for modeling viewer response over time is advanced here. An 'importance map' of the scene can be constructed using both spatial and temporal information. The image quality of an individual frame can be degraded significantly using the importance map to predict typical foci of attention. Knowledge of the whole scene can be built up over many frames, by accumulating details represented at low quality in areas identified by the importance map as warranting less visual attention. We conjecture some limitations on the image quality and provide synthesized examples of scenes coded using this model.