As with many types of information processing implementations, there is a ongoing effort to improve performance of computer graphics rendering. One of the attractive attempts to improve rendering performance is based on using multiple graphic processing units (GPUs) harnessed together to render in parallel a single scene.
There are three predominant methods for rendering graphic data with multiple GPUs. These include Time Domain Composition, in which each GPU renders the next successive frame, Screen Space Composition, in which each GPU renders a subset of the pixels of each frame, and Scene based Composition, in which each GPU renders a subset of the database.
In Time Domain Composition each GPU renders the next successive frame. A major disadvantage of this method is in having each GPU rendering an entire frame. Thus, the speed at which each frame is rendered is limited to the rendering rate of a single GPU. While multiple GPUs enable a higher frame rate, a delay can be imparted (i.e., impairing latency) in Time Domain Composition applications in the response time of the system to user's input. These delays typically occurs since at any given time only one GPU is engaged in displaying a rendered frame, while all the other GPUs are in the process of rendering one of a series of frames in a sequence. In order to maintain a steady frame rate, the system delays acting on the user's input until the specific GPU that first received the user's input cycles through the sequence and is again engaged in displaying its rendered frame. In practical applications, this condition serves to limit the number of GPUs that are used in a system.
Another difficulty associated with Time Domain Composition applications is related to the large data sets that each GPU should be able to access, since in these applications each GPU should be able to gain access to the entire data used for the image rendering. This is typically achieved by maintaining multiple copies of large data sets in order to prevent possible conflicts due to multiple attempts to access a single copy.
Screen Space Composition applications have a similar problem in the processing of large data sets, since each GPU must examine the entire data base to determine which graphic elements fall within its part of the screen. The system latency in this case is equivalent to the time required for rendering a single frame by a single GPU.
The Scene Compsition methods, to which the present invention relates, excludes the aforementioned latency problems, the requirement of maintaining multiple copies of data sets, and of the problems involved in handling the entire database by each GPU.
The Scene Composition methods well suits applications requiring the rendering of a huge amount of geometrical data. Typically these are CAD applications, and comparable visual simulation applications, considered as “viewers,” meaning that the data have been pre-designed such that their three-dimensional positions in space are not under the interactive control of the user. However, the user does have interactive control over the viewer's position, the direction of view, and the scale of the graphic data. The user also may have control over the selection of a subset of the data and the method by which it is rendered. This includes manipulating the effects of image lighting, coloration, transparency and other visual characteristics of the underlying data.
In CAD applications, the data tends to be very complex, as it usually consists of massive amount of geometry entities at the display list or vertex array. Therefore the construction time of a single frame tends to be very long (e.g., typically 0.5 sec for 20 million polygons), which in result slows down the overall system response.
Scene Composition (e.g. object based decomposition) methods are based on the distribution of data subsets among multiple GPUs. The data subsets are rendered in the GPU pipeline, and converted to Frame Buffer (FB) of fragments (sub-image pixels). The multiple FB's sub-images have to be merged to generate the final image to be displayed. As shown in FIG. 1, for each pixel in the X/Y plane of the final image there are various possible values corresponding to different image depths presented by the FBs' sub-images.
Each GPU produces at most one pixel 12 at each screen's (X/Y) coordinate. This composed pixel 12 is a result of the removal of hidden surfaces and the shading and color blending needed for effectuating transparency. Each of the pixels 12 generated by the GPUs holds a different depth measure (Z-value), which have to be resolved for the highest Z (the closest to the viewer). Only one pixel is finally allowed through. The merging of the sub-image of each FB is the result of determining which value (10) from the various possible pixels values 12 provided by the FBs represents the closest point that is visible in viewer's perspective. However, the merging of the partial scene data to one single raster, still poses a performance bottleneck in the prior art.
The level of parallelism in the prior art is limited, due to the inadequacies in the composition performance of multiple rasters. The composition of two rasters is usually performed by Z-buffering, which is a hardware technique for performing hidden surface elimination. In the conventional methods of the prior art Z-buffering allows merging of only two rasters at a time.
Conventional hardware compositing techniques, as examplifed in FIG. 2A, are typically based on an iterative collating process of pairs of rasters (S. Molner “Combining Z-buffer Engines for Higher-Speed Rendering,” Eurographics, 1988), or on pipelined techniques (J. Eyes at al. “PixelFlow: The Realization,” ACM Siggraph, 1997). The merging of these techniques is carried out within log2 R steps, of S stages, wherein R is the number of rendering GPUs. In the collating case, the time needed to accomplish comparison between two depth measures at each such comparator (MX) is log2 Z, where Z is the depth domain of the scene. E.g. for typical depth buffers with 24 bits per pixel, the comparison between two Z-buffers is typically performed in 24 time clocks.
Since in the prior art techniques the merging of only two Z-buffers is allowed at a time, composition of multiple rasters is made in a hierarchical fashion. The complexity of these composition structures is O(log2 R), making the performance highly effected by R, the number of graphic pipelines. For growing values of R the compositing time exceeds the allocated time slot for real time animation. In practical applications, this condition serves to limit the number of GPUs that are used in a system. FIG. 2B shows the theoretical improvement of performance by increasing parallelism. The composition time grows by the factor of the complexity, O(log2 R). The aggregated time starts increasing at (e.g.) 16 pipelines. Obviously, In this case there is no advantage in increasing the level of parallelism beyond 16.
Software techniques are usually based on compositing the output of R GPUs by utilizing P general purpose processors (E. Reinhard and C. Hansen “A Comparison of Parallel Compositing Techniques on Shared Memory Architectures,” Eurographics Workshop on Parallel Graphics and Visualisation, Girona, 2000). However, these solutions typically requires utilizing (i) binary swap, (ii) parallel pipeline, and (iii) shared memory compositor, which significantly increase the complexity and cost of such implemetations.
The most efficient implementation among the software techniques is the Shared Memory Compositor method (known also as “Direct Send” on distributed memory architectures). In this method the computation effort for rendering the sub-images is increased by utilizing additional GPUs (renderers), as shown in the block diagram of FIG. 3A and the pseudo code shown in FIG. 3B. In the system illustrated in FIG. 3A, 2 compositors (CPUs, p0 and p1) are operating concurrently on the same sub-images, which are generated by 3 renderers (GPUs, B0, B1, and B2). The computation task distributed between the CPUs, each performing composition of one half of the same image. It is well-known that for any given number of GPUs one can speed up the compositing by increasing the number of parallel compositors.
However, increased number of renderers slows down the performance severely. The complexity of this method is O(N*R/P) where N is the number of pixels in a raster (image), R is the number of GPUs, and P is the number of compositing units (CPUs, Pi). The compositing process in this technique is completed within R−1 iterations. In the implementation of this technique on SGI's Origin 2000 Supercomputer the compositing was carried out utilizing CPUs. The results of the compositing performed by this system are shown in FIG. 4. FIG. 4 demonstrates the overhead of this method, the compositing time required for this system is over 6 times the time required for the rendering.
All the methods described above have not yet provided satisfactory solutions to the problems of the prior art methods for compositing large quantities of sub-images data into one image.
It is an object of the present invention to provide a method and system for rendering in parallel a plurality of sub-image frames within a close to real time viewing.
It is another object of the present invention to provide a method and system for concurrently composing large amounts of sub-image data into a single image.
It is a further object of the present invention to provide a method and system which substantially reduce the amount of time requires for composing sub-image data into a single image.
It is a still another object of the present invention to provide a method and apparatus for concurrently composing large amounts of sub-image data into a single image that can be implemented efficiently as a semiconductor based device.
It is a still further object of the present invention to provide a method and apparatus for composing sub-image data based on presenting a competition between the multiple sources of the sub-image data.
Other objects and advantages of the invention will become apparent as the description proceeds.