1. Field of the Invention
This invention relates to a method and system of generating a perceptually lossless image stream using perception-based adaptive, progressive, spatio-temporal importance sampling.
2. Description of Background
The design of image display and encoding systems has been significantly influenced by known psychophysical parameters of human visual performance. Several spatial, temporal, and intensity limits of the human visual system have been exploited to increase the efficiency of image encoding and display.
The RGB encoding of color images is a fundamental example of image engineering that reflects human visual performance, in this case the tristimulus nature of visual color coding in the brain. The NTSC color coding scheme and derivatives including CCIR 601 4:2:2 encoding reflect an important performance limit of human vision, namely the decreased contrast sensitivity to isoluminant chrominance information.
In addition to limits of static contrast sensitivity, the limited temporal bandwidth of the human visual system has been exploited in the design of image display and encoding systems. The frame rate of motion picture and television formats have generally been designed to match the lower limits of visual temporal integration.
More advanced coding schemes exploit the dynamic contrast sensitivity characteristics of the human visual system. In human vision the contrast sensitivity to high spatial frequency decreases with temporal frequency above about 2 Hz. In addition the contrast sensitivity to low spatial frequencies increases at low temporal frequencies then decreases above 10 Hz. Contrast sensitivity to all isoluminant color decreases above 2 Hz. These are fundamental spatio-temporal limits of human vision which are described by the spatio-temporal contrast sensitivity function or Kelly Surface in Kelly (1985). This paper and other referenced papers are more fully cited at the end of this specification, and are incorporated herein by reference.
Several coding schemes for moving images exploit these dynamic performance limits by coding high spatial frequency luminance and isoluminant color information at a lower temporal frequency than low spatial frequency luminance information. Three-dimensional subband coding directly incorporates this type of compression. Interlace is a common encoding method that effectively transmits a crude form of three-dimensional subband encoding. Other compression schemes such as MUSE, as well as some systems that exploit interframe coherence, indirectly result in an image stream in which low spatial frequencies are displayed more rapidly than high spatial frequencies.
While image compression, encoding, and display technologies have been greatly influenced by visual performance considerations; comparatively little work has been done to exploit the spatio-temporal limits of human vision during the synthesis of computer generated imagery. This is true despite the fact that much more information is available about the visible spatio-temporal structure of the image stream during synthetic image generation than during the encoding of images using conventional film or video cameras.
One of the first methods to directly employ known limits of visual perception during image synthesis was the use of contrast metric as an adaptive sampling criteria as described by Mitchell (1987). In the method of adaptive sampling, first suggested by Whitted (1980), an image region is initially sampled with a sparse sampling and the samples are analyzed to estimate the local image information content. Further samples are obtained in subregions wherein the analysis suggests a high image information content. The initial sampled region is so subdivided by this adaptive sampling process wherein the local sample density is made in some proportion to the local image information content. FIGS. 1A through 1G are a progression of images showing successive refinements of a portion of an image bounded by rectangle ABCD using adaptive sampling. In FIG. 1A, initial rays have been traced through the corners and the center. It is determined that the colors of BE and CE are too different (by comparison to some predetermined criteria), so the image is refined by adding a new point at the center of the upper half of the rectangle bounded by BCE, as shown in FIG. 1B. Assuming that G and E still are not close, the process is repeated by adding a new point K. When K is found to stabilize the image, then the process continues with the lower half of BCE. By performing this successive refinement, the rectangle ABCD can be anti-aliased using the relative weights as shown in FIG. 1G.
In early implementations intensity differences and other statistical measures of intensity variation (e.g. variance) were determined in a sampling region and compared to predetermined intensity-based stopping criteria. Mitchell (1987) suggested the use of stopping criteria based on contrast. The contrast criteria is more perception-based than absolute intensity metrics because, for the human visual system, the amount of visible image information is determined more by contrast structure than by absolute intensity variations. Moreover Mitchell employed different contrast stopping criteria for red contrast, green contrast, and blue contrast which reflect the sensitivity of the human visual system to each of these color contrasts.
This technique of employing a wavelength based contrast refinement criteria for adaptive sampling was later extended by Bolin and Meyers (ACM SIGGRAPH Proceedings 1995, pg409-418) in a method of adaptive sampling in which image information is synthesized directly into the frequency domain. Bolin and Meyers employed the AC1C2 color space in which A is an achromatic or luminance channel and C1 and C2 are chrominance channels. In this method the refinement criteria is different for the three channels and reflects the visual system's increased sensitivity to luminance contrast as shown in FIG. 2. FIG. 2 illustrates the human spatial contrast sensitivity functions (CSF) for achromatic and isoluminant chrominance contrast. Contrast sensitivity is the inverse of the minimum visible contrast and is plotted in FIG. 2 for various spatial frequencies. Typical for human spatial contrast sensitivity functions, higher spatial frequencies require greater contrast to be visible. The spatial contrast sensitivity function is a measure of the spatial acuity of the visual system.
The method of Bolin and Meyer further develops the use of a contrast refinement criteria by effectively weighing the contrast by the spatial frequency in a way that reflects the actual spatial CSF. In this method a discrete cosine transform (DCT) representing local samples is recomputed for each new sample generated. The DCT expresses the local image structure in terms of discrete spatial frequency components. The discrete cosine transform is convoluted by the spatial CSF to compute a running overall contrast metric that is effectively weighted by spatial frequency. Changes in the contrast metric with new samples are used as criteria for further sampling. The motivation of this method is to produce image information that can be used directly by image compression schemes. Unfortunately the method is extremely expensive since it requires recomputation of DCT for each sample. Its computational cost makes it unsuitable for application to real-time image generation.
In fact, to date the general technique of adaptive sampling has not been applied to real-time image generation in any significant way. The adaptive sampling approach is not easily applied to conventional real-time graphics systems. These graphics systems are generally implemented as object-order transformation-rasterization pipelines in which the rasterization units have write-only access to a fixed, hardwired subset of samples in the image buffer. In a typical implementation the set of samples accessed by a rasterization unit may be widely dispersed throughout the image to insure load balancing of the rasterizers. This mapping of rasterizer to image does not allow adaptive sampling because the rasterizer is generally unable to read surrounding samples. In addition typical graphics systems implement rasterizers as ASICs which perform relatively simple rasterization algorithms based on linear incremental interpolation such as Pineda arithmetic. These simple rasterization units generally cannot be programed to perform more complex tasks required by an adaptive sampling scheme. For these reasons adaptive sampling has heretofore been employed exclusively in non-real-time image generation methods that use a strictly image-order point sampling technique such as ray tracing or ray casting.
In addition to spatial contrast limits of human vision the temporal responsiveness of retinal and higher processing elements imposes significant limits on the performance of the human visual system. Image elements that are presented to the visual system are not fully resolved by the system for a time period that approaches 1000 ms. FIG. 3 (from Harwerth (1980)) shows the contrast sensitivity plotted against exposure time for various spatial frequencies for two experimental subjects. From the figure it can be seen that low spatial frequencies (e.g., 0.50 cycles/degree) are fully resolved after 100 ms of exposure whereas higher spatial frequencies (e.g., 12 cycles/degree) are not fully resolved for exposures less than one second. From the figure it can be seen that low spatial frequencies (eg. 0.50 cycles/degree) are fully resolved after 100 ms of exposure whereas higher spatial frequencies (eg. 12 cycles/degree) are not fully resolved for exposures less than one second. A more recent study by Luntinen et al. also shows the pronounced effect of exposure time on contrast sensitivity, particularly for high spatial frequencies up to 1000 ms. FIG. 13 shows four graphs (A through D) in which contrast sensitivity is plotted as a function of exposure time from 0 to 10 seconds. In each graph the relationship is plotted for different total exposure areas of the viewed surface (a grating pattern). Each graph represents the relationship for one of four spatial frequencies as indicated. This study shows that contrast sensitivity is decreased by low exposure time and low exposure area. Note that the increase in contrast sensitivity with increasing spatial frequency in this case is because the measured low frequencies are below the "peak" spatial frequency on the spatial contrast sensitivity curve under the employed experimental conditions.
During real-time image synthesis it is possible to determine the exposure duration of image elements (e.g. graphics primitives) in the image stream. Using this information primitives with a low exposure time could be rendered at a decreased resolution without producing visual artifact. Existing real-time image generation systems do not exploit this fundamental spatio-temporal limit of human vision to reduce the cost of producing a perceptually lossless image stream.
A related spatio-temporal limit of the human visual system is the dynamic visual acuity (see Brown (1972b)) which expresses the acuity of the visual system for moving objects. Image elements that undergo optical flow on the retinal receptor surface produce transient stimulation of retinal and higher processing elements that may not meet the integrational time required to completely resolve the moving image elements. As a result, objects with a high retinal image velocity are poorly resolved by the visual system. FIG. 4 (from Kelly (1985)) shows the spatial contrast sensitivity function for targets (e.g., sinusoidal gratings) moving at various retinal velocities. This figure illustrates that as retinal velocity increases the sensitivity to high spatial frequencies is lost first while the sensitivity to low spatial frequencies is relatively preserved. (At very low retinal velocities (e.g. 0.012 degrees/sec) the sensitivity to low spatial frequencies is actually increased while the sensitivity to high spatial frequencies is still decreased by retinal motion). Eye movements tend to reduce the retinal image velocity of tracked image elements through the oculomotor strategies of pursuit and fixation. Nevertheless for objects with high image-space velocity, high image-space acceleration, or unpredictable image-space motion, oculomotor tracking is not completely accurate and results in retinal image motion that decreases the resolvability of the moving elements. Relationships describing the efficacy of oculomotor pursuit such as FIG. 5 (from Lisberger (1987)) which relates retinal image velocity as a function of image-space acceleration and predictability are known and can be used to estimate retinal velocity based on image-space velocity and/or acceleration of the tracked elements.
Because the image-space velocity of individual graphics elements (e.g., primitives or objects) in an image stream can be determined during image generation, the resolution of such elements could, in principle, be reduced without incurring a loss of visible information. Unfortunately no existing image generation system employs this method of selecting the rendering resolution of a graphics element (e.g., primitive) to reflect the image-space velocity of the element in a manner that reflects known limits of dynamic visual acuity. Even real-time image generation systems such as the system by CAE Electronics Ltd. of Saint-Laurent, Quebec, Canada, which employs an eye tracker that would allow the direct determination of retinal image velocity of elements, does not exploit this temporal limit of human vision to reduce the cost of image generation.
A third spatio-temporal limit of human vision, closely related to the first two, is expressed by the critical temporal sampling frequency (CTSF) for the perception of smooth motion. Successive stimulation of non-contiguous retinal elements results in a perceptual temporal aliasing phenomena that results in an appearance of staggered, unsmooth motion. This type of discontinuous stimulation of the retina can be produced by image elements with a high image-plane or retinal velocity that are displayed at a relatively low frame rate. The temporal sampling frequency at which image elements must be displayed to be perceived as having smooth motion is called the CTSF for smooth motion and is a function of the retinal velocity of the image element. The CTSF for any image element is given by the equation: EQU CTSF=f.sub.min +k.sub.max (r)
Where CTSF is the critical temporal sampling frequency for the perception of continuous motion, f.sub.min is the lowerbound on temporal sensitivity, k.sub.max the upperbound on spatial sensitivity, and r is the (retinal) velocity of continuous motion being sampled (see Watson et al. (1986) and Watson et al. NASA Tech Paper 2211). This relationship expresses the fact that the CTSF for an image element is a linear function of retinal velocity with intercept determined by a lower bound on temporal sensitivity and slope determined by the upperbound on spatial sensitivity k.sub.max. Using forced-choice experiments, Watson found that the CTSF was a linear function of motion with intercepts around 30 Hz and slopes between 6 and 13 cycles/degree for two subjects. At higher stimulus contrasts the maximum CTSF exceeded 80 Hz.
Watson's work also confirmed the earlier work of Kelly in showing a decrease in spatial acuity produced as a function of retinal velocity. The limit of dynamic visual acuity together with the CTSF relationship describe a simplified "window of visibility" function that describes the spatio-temporal performance window of human vision. This window of visibility is based on an approximation to the relationship between spatial acuity and retinal velocity given by the equation: EQU k=k.sub.max /v.sub.r
Where v.sub.r is the retinal velocity, k is the reduced spatial acuity and k.sub.max is the maximum spatial frequency limit of the visual system as determined by the spatial contrast sensitivity function. This function is shown in FIG. 6. This figure is the "window of visibility" which describes the region in spatio-temporal frequency space that represents the stimulus energies detectable by the human visual system. In this figure the spatio-temporal spectrum of an image element translating at velocity v has energy located on a line with slope v in the two dimensional spatio-temporal frequency space. For velocities less than a "corner" velocity, v.sub.c, high spatial acuity is maintained. For velocities less than v.sub.c, spatial acuity decreases with the inverse of the velocity. The corner velocity, v.sub.c is approximately 2 degrees/second based on Kelly's data (Journal of the Optical Society of America A 69(10), 1979:1340-1349). This simplified function conceptually illustrates the spatio-temporal sensitivity of human vision that is more completely described by the Kelly surface of FIG. 4.
Based on the dynamic visual acuity and the CTSF relationships, it is clear that the perceptually optimal approach to image generation in the case of rapidly moving image elements (e.g. high speed viewpoint motion) would be to maintain (or even increase) frame rate while decreasing resolution. Unfortunately, no existing image generation systems monitor the velocity of image elements to adjust the spatial and temporal sampling frequencies of these elements in a manner that reflects the spatio-temporal limits of the human visual system. In fact in the case of high image element velocity essentially every existing image generation system degrades performance in the worst possible way, by preserving resolution while allowing frame rate to decrease with system load. While some graphics systems (e.g., SGI reality engine) are able to decrease overall image resolution during high system loads, these systems do not allow individual elements to be rendered with a resolution and temporal sampling frequency that reflects their dynamic resolvability. Even graphics systems such as Microsoft's proposed Talisman architecture (see Torborg et al. (1996)) in which individual objects may be rendered at different resolutions and update rates, do not at present employ these fundamental limits of human vision to select a resolution and frame rate that is perceptually optimal for the object.
Because of their object-order organization and limited support for variable resolution rendering, existing graphics architectures typically degrade performance under high system load primarily with a decrease in frame rate. When this frame rate falls below the critical temporal sampling frequency for the perception of smooth motion a perceptually catastrophic temporal fracture is created in the image stream. Recent studies emphasize that frame rate is more important than resolution for maintaining the illusion of immersion and enhancing performance within virtual environment systems.(see Smets et al. (1995)).
One sampling approach that could, in principle, be applied to allow frame rate to be maintained at the expense of resolution is the method of refinement sampling. Fuchs first described the refinement sampling approach and applied it to a system of rendering in which the resolution of the image is increased over time by resampling at progressively higher sample densities. The refinement is achieved by a progressive stratification used in conjunction with the previously described adaptive sampling to achieve an adaptive progressive refinement of the image in which resolution is increased over time and local spatial sampling density reflects local image information content. Fuchs noted that the refinement method was best suited for implementation with visible surface determination methods that compute image samples independently, such as ray tracing. Like the closely related adaptive sampling approach, progressive refinement sampling requires a more complex approach to sampling than is achieved by typical rasterization schemes used by existing graphics pipelines. A multipass approach (rasterizing every primitive more than once) would be required to increase the resolution of a an image region over the course of a single frame interval. This would require retaining the image space representations of all primitives in a sampled region for the frame such that the rasterizing processor could repeatedly access each primitive. Existing image generation systems are typically based on a feed forward transformation-rasterization pipeline that does not, in general, allow the rasterizers to directly access primitive data in this way.
A few image transmission systems have employed the progressive refinement approach for non-real-time transmission of images. To date, however, progressive image refinement has not been applied to real-time image generation.
Bishop (1994) suggested an extreme form of progressive refinement for image synthesis in which the concept of a temporally coherent frame is entirely abandoned. In this method a randomized subset of samples in the image is rendered and then immediately updated. The newly computed samples are combined with earlier samples during image reconstruction. This approach to image generation maintains frame rate of the image stream while sacrificing the spatio-temporal coherence of single images. A simulation of the method showed that for the same computational expense the frameless rendering method produces a more fluid, less jerky image stream than conventional double buffered rendering. However, the loss of temporal coherence resulted in a crude type of motion blur and disruption of edge integrity. The authors acknowledged that the randomized sampling required by the method make it unsuitable for z-buffer based renders that rely upon object-order rasterization and incremental image-space rasterization algorithms. The renderer used in this study by Bishop, as well as in virtually all progressive refinement image synthesis, is based upon ray tracing. Largely because of the cost associated with the ray tracing implementation, this rendering method did not work in real-time. Although not stated by the designers of this system the approach is well suited to rendering image streams in which the image (retinal) velocity of image elements is high. In such cases the requisite CTSF could be maintained while the blurring and disruption of edge integrity would not be apparent because of the reduced dynamic visual acuity. On the other hand, for situations of low image-space velocity, visual acuity is increased and the edge-disruption and blurring are very visible.
Although existing image generation systems are generally unable to dynamically control image resolution, particularly on an object or primitive basis, these systems do employ other measures to maintain a desired frame rate by decreasing the amount of displayed image information. Several systems dynamically monitor frame rate and decrease the geometric detail of objects fed to the rendering pipeline as frame rate decreases below a minimum value. This approach requires that database objects have multiple representations each with a different degree of geometric or surface detail. Low level-of-detail representations are fed to the rendering pipeline when an increased frame rate is required. The method can substantially accelerate the object-order geometry engine phase of rendering but has less affect on the rasterization phase. This is because rasterization speed is more dependent on the number of pixels covered than the number of primitives rasterized. Low level-of-detail representations generally cover the same image area as the corresponding high level-of-detail object.
Sudden changes in the level of detail at which an object is represented can result in perceptually objectionable discontinuities in the image stream. This can be mitigated somewhat by rendering both high and low level of detail representations and blending the resulting images.(1995 ACM SIGGRAPH Course on Interactive Display of Large Databases, Lecture B Graphics Techniques for Walkthrough Applications). This requires additional computation precisely when the system is attempting to reduce the computational cost of rendering an image frame. A more efficient approach would be to directly create a composite rendering of the transitional object by a weighted importance sampling of 2 or more LOD representations of the object. Once again conventional object-order rasterization pipelines do not support this type of distributed sampling.
Level of detail management is used in another commonly employed method of accelerating image generation. In this method a reduced level-of-detail version of an object is rendered when the object occupies a small area of the display viewport. This acceleration has a perceptual basis in the fact that objects at a distance will project to relatively small areas of the image plane and therefore produce less image information. Existing image generation systems generally select a single level of detail for an entire object. This approach requires a preprocessing step for each object (e.g., determination of depth of the object's bounding box) and generally neglects the fact that an object may span a large range of depths relative to the eyepoint. The walls of a building or hull of a large ship are individual objects that may be simultaneously near and far from a single viewpoint. One solution to this problem is to subdivide a large object into smaller objects. Unfortunately this increases the number of objects on which the depth test or area projection test must be made.
A more effective approach would allow the selection of a specific level-of detail representation to be made for each image sample rather that globally for each object. This approach would allow blending of multiple LOD representations of an object at the sample level. This approach would require that minimum depth information be available for each image sample before the sample is actually rendered. Current image generation architectures are based on an object-order, depth-comparison rasterization pipeline that does not, in general, allow the minimum depth of a sample to be determined before generation of the sample by rasterization.
Another factor affecting the temporal responsiveness of the human visual system is adaptation. Individual retinal photoreceptors have a response range that is less than two orders of magnitude. Nevertheless the visual system can operate over an absolute dynamic luminance range exceeding 13 orders of magnitude. This dynamic range is achieved, despite a relatively limited unit response range, through a powerful automatic gain function called adaptation. Through adaptation the retinal photoreceptors adjust to the average surrounding luminance. Adaptation does not occur instantly. Adaptation to higher luminance occurs more rapidly than adaptation to lower luminances. Dark adaptation has a relatively rapid phase in which adaptation to a luminance 2 orders of magnitude lower occurs with a half life of approximately 100 ms. A slower phase, corresponding to photoreceptor pigment bleaching, requires several minutes to completely adapt to the lowest luminance levels. In a recent paper Ferwerda et al. (1996) teach a model of visual adaptation which was applied to image synthesis to simulate the effect of adaptation on acuity, color appearance, and threshold detection. In this work an adaption model was applied to non-real-time image synthesis. This model assumes that the level of adaption is uniform throughout the visual field and tied to some metric of global scene illumination (e.g. one half the highest visible luminance). In fact, the level of adaptation can vary considerably through the visual field because of regional variation in luminance of the retinal image. Changing patterns of luminance on the retinal image can result in transient mismatches between the luminance of the retinal image and the state of adaptation of the corresponding retinal elements. These transient mismatches can significantly decrease the visible information content of an image stream with high luminance range and relatively high retinal image velocity components. Methods of real-time image generation that employ eye tracking could, in principle, track the luminance exposure history for various regions of the retina and modulate the information content (resolution and level-of-detail) of the synthesized image to reflect the local balance between luminance and adaptation. Existing eye-tracked image generation systems do not identify unadapted retinal elements to reduce the computational cost of image generation in this way.
In addition to the spatial, temporal, and spatio-temporal limits of human vision already discussed, the performance of the human visual system is severely restricted by the highly non-uniform distribution of photoreceptors in the retina. This distribution of photoreceptors, and the corresponding retinotopic distribution of later processing elements in the visual cortex, results in very limited spatial acuity outside the very center of the visual field. The maximum spatial acuity is confined to a relatively small region in the center of the visual field that corresponds to a retinal region, called the fovea, that has a high density of photoreceptors.
The spatial acuity in the visual field falls to less than 10% of central acuity at 5 degrees retinal eccentricity (angular distance from the fovea) and continues to decrease with increasing eccentricity. The relationship between visual acuity and retinal eccentricity is called the acuity falloff function and is shown in the FIG. 7. Because the amount of visible image information is related to the square of the acuity, the image resolution required to match the acuity falloff function falls very steeply with retinal eccentricity. The average resolution of an image matched to the acuity falloff function would be 1/300th of the resolution at the visual center of interest.
A variety of eye tracking technologies are available to determine, in real-time, the view direction or point-of-regard of a single viewer. The point-of-regard is that location on an image plane that is in the center of the observer's visual field. This part of the visual field is processed by the observer's foveal receptor elements and can be seen with high acuity. Technologies for tracking the point-of-regard include video analysis of the corneal reflection (e.g., ISCAN Inc. Cambridge Mass.), optical analysis of the corneal reflection (e.g., U.S. Pat. No. 5,270,748) or the reflection of the iris-scleral boundary (Reulen et. al. (1988)).
Systems which generate images that are viewed instantaneously by a single viewer can, in principle, use the viewer's point-of-regard to significantly reduce the amount of computation required to generate a perceptually correct instantaneous image. Despite the availability of noninvasive real-time eye tracking equipment, very few real-time generation systems have been developed which track the viewer's visual point-of-regard and use it to reduce the computational cost of instantaneous image generation.
Early work on display systems designed to match acuity falloff was motivated by a need to generate very wide field-of-view images using limited image generation hardware. These early designs incorporate non-linear optical (McDonnell Aircraft, Fisher R. W. Society for Information Display International Symposium Digest of Technical Papers, pg. 144-145, 1982) or holograph elements (Hughes Aircraft U.S. Pat. No. 5,071,209) to project a nonlinear image from a conventional display device (CRT or Light Valve) onto a wide field-of-view screen. In these projective methods the non-linear projection of a nonlinear image results in a perspective image in which the resolution decreases as a function of the angular distance from the viewer's point of regard.
Designs based on non-linear image generation and projection present very serious technological challenges. To date no working systems based on this design have been produced. In these systems the optical or projective elements must be moved with velocities and accelerations that match that of human eye movements, which achieve angular velocities of several hundred degrees per second and angular accelerations in excess of 1000 degrees/second.sup.2. Perhaps overshadowing the optical-mechanical problems of such projective systems is the computational complexity of generating the requisite nonlinear images in real time. These systems require the generation of images in which more peripheral pixels represent increasingly large areas of the image. An example of such an image of a rectangular grid is shown in FIG. 8, from Fisher. The synthesis of such images involve non-affine projections which fail to preserve the linearity of the edges of graphics primitives when projected onto the image plane. Conventional hardware rasterization units are incapable of directly rendering such non-linear primitives because they employ linear incremental algorithms (e.g. Pineda arithmetic) that depend on linearity of primitive edges as well as the linearity of image-space gradients on the interior of the primitive.
An alternative design for area-of-interest systems avoids nonlinear projective elements and employs conventional image generation methods. In this type of design a conventionally generated low resolution image is displayed together with a higher resolution image insert that is positioned at the viewer's point-of-regard. At least one system based on this design is currently commercially available (CAE Electronics Ltd.). This system employs two independent image generation systems one for the low resolution background image and one for the high resolution insert at the area-of-interest. The advantage of this design is that conventional image generation hardware can be employed. A principle disadvantage is that the design attempts to approximate the acuity falloff function with a simple step function. The exact shape of the step function will determine the degree of computational savings gained and will affect the perceptual accuracy of the image. Because the acuity falloff function is poorly approximated by a single step function the computational savings that results from this method will be significantly less than the theoretical maximum savings.
In fact for a circular display subtending 100 degrees of visual angle the maximum computational savings that can be obtained using a single step function occurs when the high resolution inset is a circular window subtending approximately 20 degrees. In this case the area-of-interest is a circle with radius 20 degrees that is computed at full resolution and the outer circular region of the image is computed with a resolution of approximately 3% of the area-of-interest circle. This stepwise approximation to the acuity falloff function is shown in FIG. 10. The average computed resolution in the entire image in this case is only 6% of maximum resolution, representing a speedup factor of approximately 16.3. By comparison the average image resolution obtained by employing a continuous approximation accurate to within 0.5 degrees of the acuity falloff function is only 0.5% representing a potential speedup factor of approximately 200. Note that the estimate for the step function case assumes that the low resolution area covered by the inset is rendered but not displayed. This is the approach used by the CAE system.
Additional savings can be obtained using a step function that falls below the acuity falloff curve but this would result in a visible discontinuity at the edge of the high resolution insert. Several rasterization pipelines with different depth buffer resolutions could potentially be employed to create a multistep approximation to the acuity falloff function. In the simplest implementation each z-buffer rasterizer would generate a rectangular image at a specific resolution and corresponding location. The resulting images could be composited so that progressively higher resolution images are exposed as concentric inserts. This approach would provide a better approximation to the acuity falloff function than a single step method. However the improved correlation does not produce a corresponding reduction in the computational cost of rasterization because this architecture incurs considerable redundant rasterization as a result of overlapping the various regions.
An image generation method capable of producing a more continuous resolution gradient could more closely approximate the acuity falloff function; resulting in increased computational efficiency without the perceptual discontinuities that result from step function approximation.
Virtually all general-purpose, real-time image generation systems employ a transformation stage in which the control points, or vertices, of graphics primitives undergo planar perspective projection onto an image plane. In typical implementations this is followed by a second process, usually called rasterization, in which the samples or pixels corresponding to the projection of the primitive are rendered. The result of such a system is to effect a planar perspective of the graphics database onto a viewport. Unfortunately, because the retina is not flat, the neural processing elements of the brain that subserve vision have evolved to analyze a more nearly spherical, not planar perspective projection of the actual world. Consequently existing systems that employ planar projection as part of the image generation process produce a perceptual distortion that is caused by differences between the image generation system's planar perspective projection and the approximately spherical perspective projection of the visual system's image analysis manifold, the retina.
For view angles that are not too wide (e.g. less than 50 degrees field-of-view) differences between planar perspective projection and spherical perspective projection are small for single images. However even at relatively low view angles, planar perspective projection can result in a pattern of optical flow that is noticeably different from the optical flow fields produced by the corresponding spherical perspective projection. This is most easily seen for the case of rotation of the view direction vector about one axis centered at the viewpoint. A continuous rotation, at constant angular velocity, of the view direction vector about the y (up) axis, centered at the viewpoint, produces a look around or pan pattern of optical flow on the projection surface. In the case of spherical perspective projection, this panning motion produces an optical flow field in which all elements have a constant velocity of optical flow throughout the projection manifold. In contrast, under planar perspective projection the same panning motion produces an optical flow field in which elements at a greater angular eccentricity from the view direction vector have a higher image-space velocity than samples near the center of the projection. This produces a familiar "stretching" near the edge of the viewport and "squashing" near the center of the viewport that is apparent even at moderate view angles and which becomes extreme for larger fields-of-view. Because the visual system is exquisitely sensitive to optical flow patterns this distortion is quite apparent even at moderate view angles and tends to disrupt the sense of immersion or virtual presence that is the goal of high definition real-time graphics systems.
Despite the rather non-physiologic character of planar perspective projection its use greatly simplifies the process of rasterization which is central to existing graphics pipelines. As a result it is widely employed in image generation systems based on rasterization. Planar perspective projection of vertices preserves the image-space linearity of the edges of polygonal graphic primitives. Because the object-space to image-space mapping effected by planar perspective projection preserves the linearity of these edges the image-space boundaries of the primitive can be established by linear incremental methods. Once the boundaries of a primitive are established, for example its extent in a single scan line of the image, then the interior of the primitive is rendered by interpolating object space values such as depth, surface normal and texture parameters, across the image space extent of the primitive. For primitives projected onto the image plane by planar perspective projection, these linear image-space segments correspond to linear segments on the surface of the primitive in object-space and so can be interpolated with linear incremental methods.
For spherical projection, on the other hand, the image-space manifestation of polygonal primitive edge is not linear. As a result the image-space extent of a primitive under spherical perspective projection cannot be determined from the projection of the primitive's vertices. Moreover a linear span of pixels or samples within a primitive on a spherical projection manifold corresponds to a curved set of samples on the surface of the primitive in object-space. This non-linear image-space to object-space mapping cannot be computed by a linear incremental interpolation and so significantly complicates the process of rasterization.
Optical or holographic elements can be employed to convert an image generated by planar perspective projection into a spherical projection. This approach is not applicable to the typical case of images generated for display on a CRT or flat panel display and would require a light-valve or other projective technology. Image processing can be applied after rendering to convert the planar projection into a spherical projection. This approach introduces an additional step into the image-generation process and does not produce the same result as directly computing the spherical projection because the density of image information in the two mappings is different. Consequently, it would be desirable to compute the spherical perspective projection directly. The direct rendering of a spherical projection can be accomplished using rendering methods that, unlike conventional rasterization, do not depend on preserving the linearity of mappings between object-space and image-space. Rendering methods such as ray casting which employ an image-order, point sampled approach to image synthesis do not depend on these linear mappings and can directly produce a spherical perspective rendering. To date there are no general-purpose real-time image generation systems based on ray casting. Several special-purpose image generation methods employing, voxel or height-fields (e.g., U.S. Pat. No. 5,317,689 assigned to Hughes Aircraft, and U.S. Pat. No. 5,550,959 assigned to Nova Logic Inc.) or highly simplified geometries (e.g. the computer game Doom.TM. distributed by Id Software) exist that employ a simplified form of spatial subdivision ray casting based two dimensional grids. For these systems the restriction of database form limits the methods to special applications. Moreover many of these methods (e.g. 2-D grid ray casting) also employ image space interpolations that require linear mappings between image-space and object space, thereby making them unsuitable to direct computation of spherical, or other non-linear projections.
The optical subsystem of the eye also imposes limits on the performance of the visual system. At any instant the eye has a focal depth that is determined by the dynamic state of accommodation of the lens, the pupillary diameter, and the fixed focal characteristics of the cornea. Because the optical system has a limited depth-of-field, objects substantially removed from the focal plane will not be focused properly on the retina thereby reducing the resolvable detail of the element. In conventional image generation systems, images are generated by planar perspective projection that does not account for the optical refraction and dispersion of rays that normally occurs in the lens of the eye. As a result all elements in a typical computer generated image are in focus regardless of their depth from the viewpoint. The depth-of-field effect which produces blurring for objects removed from the focal plane has been simulated in some non-real-time image synthesis methods. One of the first methods used to simulate this effect was the method of distributed ray tracing [Cook R. L., Porter T, Carpenter L. Distributed Ray Tracing ACM SIGGRAPH Proceedings 1984 pg. 137-145). This method employs ray tracing in which the distribution of rays is selected to travel through different parts of a lens with known focal length and aperture. The method uses an assumed focal distance for computing images in non-real-time. Distributing the rays in this fashion simulates the dispersion that produces the depth-of-field effect. In this approach the additional computation is required to compute a reduction in visible image information. An alternate approach would be to determine those parts of the image that are out of the viewer's focal plane and render them with reduced resolution or level-of-detail, thereby reducing the computational cost. Real-time image generation systems that employ eye-tracking could, in principle, determine a viewer's instantaneous focal depth (e.g. by employing an oculometry system capable of real-time determination of ocular focus) and render graphic elements substantially removed from the focal plane at a lower level-of-detail or resolution. To date no existing real-time image generation system employs this method.
From the foregoing analysis it is clear that existing real-time image generation systems do not effectively exploit several significant spatial, temporal, and spatio-temporal limits of human vision that could reduce the cost of image generation. From this analysis it is also clear that in order to exploit these limits of human vision, image generation systems should employ progressively stratified, adaptive refinement sampling and other variable resolution techniques that are not readily implemented in conventional rasterization pipelines.