1. Field of the Invention
This invention relates to a method and system for generating and encoding images using primitive reprojection.
2. Description of Background
Computer-based graphics rendering typically is performed based on one of two techniques--either: 1) rasterization, or 2) ray tracing. The uses and advantages of each system are discussed below.
Most commercially available general-purpose real-time image generation systems employ a common approach to visible surface determination. As shown in FIG. 1, this approach is based on a graphics pipeline in which primitives undergo geometric processing 200 in the first stage and some type of depth-comparison rasterization 204 in a later stage. Parallel implementations of this method generally employ a two stage pipeline in which the first stage is object parallel and the second stage is intrinsically image parallel. This depth-comparison rasterization approach to visible surface determination has been implemented in a variety of different hardware architectures with the goal of real-time performance. Although these implementations demonstrate significant architectural differences, they share similar communication costs and scalability limits imposed by the underlying object-order organization of the pipelines.
Molnar et al (hereinafter "Molnar") disclose in "A Sorting Classification of Parallel Rendering", published in IEEE Computer Graphics and Applications, June 1994, a useful classification of the three main parallel graphics architectures, based on how the object parallel phase and the image parallel phase of the pipelines are combined. This classification can also be considered to be based on how the geometry processing stage and the rasterization stage of the pipeline access primitive data. The three parallel graphics architectures are 1) sort-middle, 2) sort-last, and 3) sort-first. Most real-time image generation systems are based on a sort-middle architecture. As shown in FIG. 2, in a parallel sort-middle architecture, primitives from the graphics database are arbitrarily assigned to one of the object-parallel pipes where there are subjected to geometry processing 200. As a result, screen-space primitive information is distributed by broadcast or crossbar networks to a set of image parallel rasterization processors which perform z-buffer rasterization 204. As shown in FIG. 3, other real-time image generation systems employ a sort-last or image composition architecture in which primitives are initially arbitrarily assigned to processors that perform both primitive transformation (in geometry processing 200) and rasterization 204. Based on the primitives which it has received, each of these processors rasterizes an incomplete image of a partial set of the primitives. In the compositioning phase 208, the complete image is constructed by a depth buffered composition of the plurality of incomplete images from all processors. In this case the primitive distribution crossbar is replaced by a high bandwidth image composition interconnection network. The interconnection network is either: a) a sort-last sparse (SL-sparse) interconnection network that receives only pixels generated during rasterization, or b) a SL-full interconnection network which receives a fall image from each processor. To achieve their desired performance, these primitive crossbars or image composition networks are currently implemented as dedicated special purpose hardware.
These depth-comparison rasterization architectures generally require special purpose high bandwidth geometry buses or image composition networks to effect the convergence of the object parallel and image parallel stages of the graphics pipeline. The communication costs associated with this task generally are met with dedicated special purpose interconnects only found in special purpose hardware.
In addition to the special purpose communication hardware required to recombine the object parallel and image parallel stages of the pipeline, most graphics systems also employ hardwired communication channels between rasterizers and frame buffer. A hardwired connection between rasterizers and z-buffer/frame buffer generally does not allow for dynamic, task-adaptive rasterization processor assignment. As a result, for most sort-middle or z-buffer architectures, the fixed relationship between rasterization processors and the z-buffer creates an intrinsic tradeoff between primitive overlap and load balancing. Rasterizers that are mapped to relatively large contiguous regions of the z-buffer are susceptible to load imbalance when too many primitives project to a small area of the viewport. Alternatively, a rasterizer-to-z-buffer mapping in which each rasterizer sub-serves several small interspersed regions of the z-buffer reduces the clumping that tends to occur in larger contiguous regions. However larger primitives and smaller sub-image regions increase the primitive overlap factor which increases the communication requirement and produces additional per-primitive setup calculations. The Silicon Graphics Reality Engine (Akeley 1993), a high performance system, uses small interspersed subregions and a triangle bus that supports essentially all-to-all broadcast to the rasterizers. As with many high performance graphics systems, the Reality Engine employs a hardwired communication channel between rasterizers and frame buffer.
The Pixel-Planes 5 (Fuchs et. al. 1989) graphics system is a sort-middle architecture in which a single dedicated communication system (based on a ring) sub-serves both primitive distribution to the rasterizers and transmission of pixel information from the rasterizers to a frame buffer. Each of the rasterizers is hardwired to a small 128 pixel square z-buffer. The pixel block corresponding to this small z-buffer can be dynamically mapped to the frame buffer. As described above, the rendered pixel data is transferred from the rasterizers to the frame buffer over the ring network. Since this architecture does not depend upon a fixed connection between a rasterizer and a frame buffer, it accommodates task adaptive load balancing to reduce load imbalances caused by primitive clumping. However, like other architectures based on depth buffered rasterization it requires a high-speed dedicated communication network to sub-serve primitive distribution and pixel transmission.
As described in Molnar, the high communication cost of primitive distribution, image composition, and pixel transmission to frame buffer is prohibitively expensive to implement on typical multiprocessor-memory communication channels and generally limits implementation of high performance depth-comparison rasterization to special-purpose dedicated hardware. In contrast some early graphics architectures like the Stellar Graphics Supercomputer (described in Proceedings of ACM SIGGRAPH Vol 22, Number 4, 1988. pp 255-262, incorporated herein by reference) were based on a truly unified model of communication in which all of the communication requirements were met on the main processor-memory bus. Unfortunately the communication cost of this process makes the memory bus a significant bottleneck to performance. Even with the greater processor-memory bandwidth provided by modern shared-memory multiprocessor interconnects, these communication costs quickly become performance limiting.
Molnar describes that the sort-first architecture is the least explored architecture and would potentially have a lower communication cost than either a sort-last or a sort-middle architecture. However, Molnar further indicates that no one is known to have been able to build a sort-first architecture. As shown in FIG. 4, in this architecture each processor is assigned a portion of the image to render. Unlike sort-last processors, each processor in a sort-first architecture transforms and rasterizes all primitives for its corresponding image region. No compositing of the sub-images is required. During pre-transformation, primitives are initially applied arbitrarily to processors to determine to which region(s) each primitive falls. Then each primitive is transmitted to the processor(s) in which the primitive may appear. This partitioning of primitives also forms the initial distribution of primitives for the next frame. In this architecture primitives are only transferred when they cross from one processor to another. Since the image-space distribution of primitives generally changes very little from frame to frame, only a small fraction of the primitives need to be communicated for each frame. In this way a sort-first architecture can, in principle, substantially lower communication costs by exploiting frame-to-frame coherence. As explained by Molnar and Mueller (The First-Sort Rendering Architecture for High-Performance graphics, published in the ACM 1995 Symposium on interactive 3D Graphics, 1995), a typical image sequence contains a substantial coherence of on-screen primitive movement that can be exploited by the sort-first method to reduce the rate at which primitives need to be redistributed between subimage display lists. Neither of the other two depth-comparison rasterization architecture can exploit frame-to-frame coherence in this way.
Unfortunately while the sort-first architecture is efficient at resolving visibility relationships among primitives that were visible in an earlier frame, the classification and communication costs are high for primitives not visible in an earlier frame. In addition the classification and distribution of primitives that become invisible is also problematic. Primarily for these reasons, no system employing a sort-first architecture has been built.
Another disadvantage of known rasterization architectures is that performance and scalability is limited by the object-order organization of the pipeline. In principle each primitive in the database must be transformed, projected to the viewplane, and clipped to the viewport in order to determine potential visibility by inclusion in the viewing frustrum. As a result the geometry phase of the pipeline is intrinsically O(n) in the number of primitives in the database. The use of bounding volumes or spatial subdivision of the database can reduce the computational cost of the geometry phase by a constant factor. Hierarchal organization of the bounding volumes can produce additional reductions in the number of primitives processed in the geometry phase. However, these methods of view frustum culling still produce O(logn) dependence on the number of primitives in the database. The scalability of known rasterization architectures is also limited because primitive rasterization is not amenable to a front-to-back depth prioritization. In principle the visibility problem is most efficiently solved in a front to back order from the viewpoint. However, the spatial extent of polygonal primitives prevents a simple front-to-back solution of the visibility problem on a per-primitive basis. As a result these methods depend upon rasterizing all forward facing primitives in the view volume and comparing depth at each sample point produced by the rasterization. The cost of such rasterization is intrinsically linear in the total number of forward facing primitives in the view volume. Thus, the cost of rasterization is O(n) in the depth complexity of the scene. Rasterization performance can also be improved by employing pre-computed visibility sets which are lists of primitives that are potentially visible from specific regions of a 3-D database. As shown in FIG. 5A, a source cell 500 is chosen, and it is initially pre-computed which visible cells 504 can be seen from the source cell 500 versus which occluded cells 508 cannot be seen. Then, as shown in FIG. 5B, it is pre-computed which primitives correspond to individual visible objects 512 in the visible cells 504 and which primitives correspond to occluded objects 516 in either visible cells 504 or occluded cells 508. Likewise, as shown in FIG. 5C, the technique can be expanded to pre-compute visibility in three-dimensions rather than two dimensions. In FIG. 5C the visible beams are determined based on the source cell 500. These pre-computation methods generally use a spatial subdivision of the database together with frustrum clipping techniques that determine the potentially visible primitive set for each cell of the subdivision based on knowledge of the portals (e.g., doors and windows) within the regions.
Several different methods of polygon flow minimization based on pre-computed visibility sets have been proposed. One of the most notable techniques is the technique described by M. Abrash published in 1996 in Zen of Graphics Programming by the Coriolis Group, Inc., incorporated herein by reference. This technique is used in the multi-player, first-person action game called Quake which is distributed by ID Software. Such methods are generally applicable to models in which visibility is restricted by wall-like geometry and unrestricted by window-like openings and perform poorly for less restricted types of virtual environments. The National Research Council's Committee on Virtual Reality Research and Development 1995 Report concludes that there is at present no general solution available to the problem of polygon flow minimization or pre-computed visibility for known graphics architectures.
Known depth-comparison and list-priority approaches to visibility determination do not solve the visibility problem in the most efficient front-to-back order. An alternate method of visibility determination that proceeded in a prioritized front to back order, stopping when the nearest visible sample is encountered, would scale better with the overall depth complexity of the scene. Visibility tracing methods of image synthesis including ray tracing and ray casting solve the visible surface problem in an intrinsically depth-prioritized front-to-back order. In the method of ray casting and ray tracing, rays which originate at the viewpoint are intersected with objects in the environment as shown in FIG. 6. When this intersection is conducted by extending the ray incrementally from the viewpoint into the environment (e.g., stepping through grid structure spatially subdividing the database as shown in FIGS. 7A, 7B, 7C) then the visible surface determination is made in a depth-prioritized, front-to-back order.
Unfortunately the high computational cost of ray tracing and ray casting has prevented efficient general purpose real-time implementations. Real-time ray casting methods have been developed for specially restricted databases such as the height-mapped databases and associated 2-D grids found in DOOM (distributed by ID Software), Commanche (U.S. Pat. No. 5,550,959; Assignee: Nova Logic, Inc.) and related computer games. In DOOM and similar games, as shown in FIG. 8A, a series of rays are cast from a player's position so that they cover the player's field of view. When a ray hits an object in the height-mapped database, the texture of the object is copied from a texture map into the final three-dimensional view, starting at the base of the object and continuing upwards for the height of the object. (If an object does not extend completely to the ceiling then the system searches for the next object behind the initial object.) When all rays have been cast and the textures mapped, a three-dimensional view, such as shown in FIG. 8B is generated and displayed.
As shown in FIG. 9, real-time ray casting has also been employed for voxel databases organized by hierarchical three-dimensional grids, as described in U.S. Pat. No. 5,317,689, incorporated herein by reference. While these methods are quite scalable in database size and depth complexity the geometric restrictions imposed by the height field and voxel representations generally limit their usefulness to relatively few special applications.
A method called reprojection was later developed to reduce the high computational cost of ray tracing. The premise behind reprojection is to employ information from the current frame of an image sequence to reduce the amount of computation required to generate the subsequent frame. In the typical implementation image samples from one frame are transformed and reprojected to the viewport for the subsequent frame using known transformation matrices for perspective projection. This process is shown for two different viewpoints and corresponding viewports in FIG. 10 from a paper by Sig Badt entitled in "Two algorithms for taking advantage of temporal coherence in ray tracing," and published in Visual Computer, Number 4, 1988, the contents of which are incorporated herein by reference. In the FIG. 10 samples are effectively moved from their position in one image to their position in another image by applying a concatenated transformation matrix representing object motion, camera motion, and perspective projection. Ray tracing is required only to fill the deficiencies or "holes" left in the image by this reprojection of previously visible samples. Sig Badt first suggested reprojecting image samples, which have a dualistic image-space object-space character, to decrease the amount of ray tracing required to generate frames in an animated sequence. Subsequent development of the reprojective paradigm followed two different paths. Initially reprojection was further developed as an adjunct to ray tracing. An alternative developmental path has resulted in systems of image based rendering that employ approximations to reprojection.
Adelson et al. (1993), Adelson et al. (1995), and Fu (1996) refined the techniques of sample reprojection as applied to ray tracing by recognizing that depth buffering could be employed to resolve the visibility among previously visible samples. In addition these authors identified two different types of image deficiencies or "holes" caused by reprojection. The exposure hole is the result of a newly visible surface being exposed in the image sequence and requires ray tracing to fill. The expansion hole is the result of intersample expansion caused by perspective projection and can sometimes be filled by a more efficient interpolative method described by Fu.
Existing methods of sample reprojection are generally restricted to rendering static environments under camera motion. This restriction results because exposure of moving objects is not limited to identifiable exposure areas that can be selectively ray traced. Similarly, existing reprojective methods of image synthesis can fail to render certain static objects that penetrate the view volume as a result of view volume motion. This failure occurs because the exposure of such newly penetrating objects is also not limited to specific exposure areas identified by the reprojection of previously visible samples. In addition existing reprojective methods of image generation are prone to certain visibility errors that result from finite temporal sampling. Moreover existing methods of reprojective image generation experience serious performance degradation for image sequences which contain a high rate of occlusion-exposure transitions. The aforementioned limitations have significantly restricted the application of existing methods of reprojective image synthesis. For example, there is at present no system which performs actual reprojective image synthesis in real-time. In a later section of this specification it is disclosed how the present invention corrects these specific deficiencies and otherwise improves on the accuracy, efficiency and versatility of existing reprojective methods to allow a real-time implementation.
Existing reprojective methods have limited commercial importance as applied to ray tracing since sample reprojection is practically limited to reducing only the number of first generation rays traced. In commercial renderers ray tracing is usually not employed as a method of primary visible surface determination but more as an illumination method employing only higher generation reflected and refracted rays. In addition with the exception of a few interesting systems, that are discussed later in this specification, general-purpose real-time image generation systems do not employ a ray tracing or ray casting method of visible surface determination that would be amenable to acceleration by sample reprojection.
Perhaps the most important commercial application of the reprojective paradigm has been in image based rendering or view interpolation techniques that are essentially interpolative approximations to sample reprojection. The general view interpolation method of Chen et al. (1993) in principle allows unrestricted camera motion in the display of a static database. The method does not actually synthesize images in the conventional fashion from a database representing three dimensional objects. Rather it employs a very large number of pre-rendered images to compute visibility using what is essentially an interpolative approximation to reprojection. The size of the image database must be sufficient to include images that encode the visibility of the entire surface of every object in the model from any possible view position and direction. The size of the database required to encode this total visibility within a 3-D model is prohibitively large for practical implementations. No commercial implementations of this method have been developed.
Plenoptic image based rendering methods (such as Quicktime.RTM.VR described by Chen (1995)), are essentially highly constrained reprojective methods in which camera motion is restricted to pan and zoom capability. The simplified type of optical flow that results from this restricted camera motion does not produce occlusion or exposure events. This allows image sequences representing view direction vector rotation from a single viewpoint to be computed by the reprojection of a single panoramic image in real-time.
A recently disclosed graphics architecture that makes limited use of approximate reprojective methods is Microsoft's Talisman architecture described in Torborg et al. (1996). In this method of real-time image generation, affine image-space approximations to actual reprojection are applied to entire images of individual objects that are stored in separate object image buffers. Visibility among previously visible objects is resolved by composition of these individual images. In this architecture object-space primitive transformation and rasterization operations are replaced by less expensive image-space image processing operations which approximate the object-space transformations. Unfortunately affine image-space transformations applied on an object basis cannot accurately approximate the complex optical flow patterns caused by object-space transformations. Such transformations typically result in a heterogenous pattern of optical flow for a single object that is poorly approximated by a single affine image-space transformation (or combination of such transformations) applied to the entire object. In addition the technique of applying image-space transformations to images of entire objects neglects the visibility changes that result from object self-occlusion or self-exposure. By approximating the image of a moving object with the moving image of an object the method produces visibility errors along the terminator silhouette of objects. The Talisman architecture attempts to exploit temporal image coherence in a very simplistic way by essentially modeling three dimensional scenes as a collection of 2-D multiplane layers. In addition because it provides no method to identify or search exposure regions and does not remove occluded objects from processing the method fails to realize the full power of the reprojective approach to visible surface determination.
Another limitation of existing real-time image generation architectures is that the intrinsic organization of known rasterization pipelines is not well suited to efficient client-server implementations. The goal of an efficient client-server system is to decrease the computational and storage demands on the client while minimizing communication between client and server. Existing methods of distributed three-dimensional graphics generally employ an image transmission approach in which the server generates images in real-time which are transmitted to the client for display. Alternatively the data representing three dimensional objects is stored on a server unit and is completely downloaded to a client unit. In this geometry replication approach the server unit functions in the capacity of a database server which makes the three dimensional database available to client units. Client units generally download the complete database and perform image generation on the downloaded database. Systems such as Distributed Interactive Simulation, using the Department of Defense Advanced Research Projects Agency SIMNET protocol, and game servers allow limited data representing avatar, vehicle or player information to be updated in real-time and transmitted to client units.
In methods of client-server graphics that employ image transmission, the server unit methods require a relatively high bandwidth connection even if real-time image compression and decompression is employed.
Geometric replication methods such as VRML generally have a lower bandwidth requirement but do not decrease the storage or processing requirements of the client. For these systems the client must be capable of rendering the complete replicated database in real time. The "server" does not provide image generation services. Rather it functions primarily as a database server. Levoy, in Polygon-Assisted JPEG and MPEG Compression of Synthetic Images published in the SIGGRAPH 95 Conference Proceedings, discloses a somewhat more efficient client-server method based on hybrid image transmission/geometry transmission approach. In this method the server unit maintains a low level-of-detail database and a high level of detail database. For each frame of a real-time image stream images are generated from both the low level-of-detail and high level-of-detail database. The difference between the two images is computed using known methods of image processing. A client unit contains only a low level-of-detail geometric database and renders this database for each frame. The difference image computed on the sever is compressed and transmitted to the client in real-time which decompresses and composites it together with the image generated by the client from the low level-of-detail database. The result is a high level-of-detail, high quality image which requires a relatively low transmission bandwidth (to transmit the relatively low information difference image) and which requires relatively limited client storage and processing capabilities. While this method can decrease the computational load of the client and reduce the communication costs, it requires both real-time image compression and decompression and further requires that special image compositing be added to the pipeline. In addition the entire low level-of-detail database must be stored and processed by the client. Alternatively Levoy discusses the optional approach of transmitting a transformed and compressed low level-of-detail representation of the geometric database in screen-space representation. This would require that the low level-of-detail geometric primitives be transmitted for every frame which increases the required connection bandwidth. While only the primitives that are actually visible for each frame would need to be transmitted in this approach, Levoy does not indicate how these visible primitives would be identified. Moreover since primitives visible in a current frame are likely to be visible in a subsequent frame, the repeated transmission of primitives that have been transformed to image-space representation is an inefficient use of available bandwidth.
Another approach to distributed client-server image generation is based on demand-driven geometry transmission to the client, and is disclosed in Demand-Driven Geometry Transmission for Distributed Virtual Environments, by Schmalstieg et al., published as part of Eurographics 96, and incorporated herein by reference. In this method the server determines and periodically updates a list of potentially visible primitives for each client using a spherical clipping volume around the viewpoint. The result is compared to a list of primitives previously transmitted to the corresponding client and only those potentially visible primitives that have not been previously transmitted are sent to the client. This method reduces communication cost by limiting transmission to those primitives that have just become potentially visible. Since the client replaces primitives in the client display list when they are no longer included in the spherical clipping volume, the storage and compute requirements of the client are limited to only those primitives in the potentially visible set. However, since this method uses a limited inclusion volume, one disadvantage is that the method can somewhat arbitrarily exclude distant primitives from the potentially visible set. Moreover the use of a spherical inclusion volume results in the inclusion and transmission of a large number of geometric primitives that are not visible in the current frame and are unlikely to be visible in upcoming frames (for examples primitives in the inclusion sphere "behind" the viewpoint). As a result the methods makes inefficient use of available transmission bandwidth and available client storage and compute resources. Another disadvantage of this method is that the client must compute removal of primitives by clipping to the inclusion volume and must also implement display list replacement and compaction protocols.
In addition to the computation and display of computer generated imagery in real-time, images can be synthesized in non-real-time and stored on recordable streaming media for later real-time display. A variety of different approaches have been developed which allow users to exercise various degrees of interactivity during the replay of prerecorded computer generated image sequences. On one extreme, interactivity is limited to the ability to start and stop a pre-rendered image sequence that may be stored on a recordable medium. While this type of interactivity is extremely limited, a streaming data source has the advantage that, given sufficient decoding performance, its display is limited only by the transmission requirements of the data stream. This limited degree of interactivity is in contrast to the more complete level of interactivity that results when an entire 3-D graphic database is downloaded to a display unit capable of real time image generation. In this data replication case interactivity with the database is complete but the rendering performance of the receiving unit limits the size of the database that can be visualized and hence the viewable graphic content. Some systems attempt to combine the advantages of a streaming image sequence with the interactivity of a 3-D database. One example is MegaRace II and similar games of the "rail shooter" genre. In this system a rendered image sequence forms the background on which a limited 3-D database is rendered in real-time. In typical implementations vehicles and obstacles comprise the 3-D database which are under interactive control. The pre-rendered sequence provides the background on which the interactive image generation occurs. Typically the pre-rendered image sequence portrays a "flyover" type view of a roadway or other pathway to which the 3-D objects are confined. Movement of the user's vehicle to predetermined locations on the roadway initiates the streaming of the image sequence of the roadway and associated wayside. In this way a "chase" view of the users vehicle is provided as the vehicle negotiates the pathway. The speed of the chasing camera can in principle be regulated by controlling the number of frames per unit distance that are displayed while keeping the number of frames per second constant. In practice to provide a choice of 10 different speeds would require an animation with 10 times the number of frames required for a constant speed implementation. Moreover the incremental encoding schemes employed by typical video codecs often do not allow display of every nth frame of an animation. As a result current implementations do not allow the speed of the camera to be varied. Typically as the target vehicle changes speed the camera sequence is forced to suddenly start and stop discontinuously. Because of a limited ability to control the speed and the inability to control view direction such pre-rendered sequences cannot be employed for first-person views, e.g. from within the vehicle.
Another approach to providing increased interactivity with a pre-rendered image sequence is to allow changes in the view direction vector during replay of the animation. This capability is provided by systems such as QuickTimeVR movie system. In this method a wide field-of-view image is encoded for each viewpoint in the animation. The image is stored as a cylindric projection which can be converted to a planar perspective projection during decoding. The field-of-view used during the production of the image sequence is larger than the field-of-view of the decoded, displayed sequence. In this method the decoding unit selectively accesses sections of each cylindrical image that map to the decoding unit's current viewport. The decoding unit performs the cylindrical to planar projective remapping in real-time. The actual set of viewpoints defining the animation is constrained by this method to follow a fixed trajectory defined by a space-time curve. Thus while the datastream provides the user with a limited type of "look around" capability during replay of the image sequence; the actual vantage points are constrained by the initial encoding.