1. Field of Invention
The present invention relates to new and improved ways of and means for carrying out the object division method of parallel graphics rendering on multiple GPU-based graphics platforms associated with diverse types of computing machinery.
2. Brief Description of the State of the Knowledge in the Art
There is a great demand for high performance three-dimensional (3D) computer graphics systems in the fields of product design, simulation, virtual-reality, video-gaming, scientific research, and personal computing (PC). Clearly a major goal of the computer graphics industry is to realize real-time photo-realistic 3D imagery on PC-based workstations, desktops, laptops, and mobile computing devices.
In general, there are two fundamentally different classes of machines in the 3D computer graphics field, namely: (1) Graphical Display List (GDL) based systems, wherein 3D scenes and objects are represented as a complex of geometric models (primitives) in 3D continuous geometric space, and 2D views or images of such 3D scenes are computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, typically based upon laws of physics; and (2) VOlume ELement (VOXEL) based systems, wherein 3D scenes and objects are represented as a complex of voxels (x,y,z volume elements) represented in 3D Cartesian Space, and 2D views or images of such 3D voxel-based scenes are also computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, again typically based upon laws of physics. Examples of early GDL-based graphics systems are disclosed in U.S. Pat. No. 4,862,155, whereas examples of early voxel-based 3D graphics systems are disclosed in U.S. Pat. No. 4,985,856, each incorporated herein by reference in its entirety.
In the contemporary period, most PC-based computing systems include a 3D graphics subsystem based the “graphics display list (GDL)” system design. In such graphics system design, “objects” within a 3D scene are represented by 3D geometrical models, and these geometrical models are typically constructed from continuous-type 3D geometric representations including, for example, 3D straight line segments, planar polygons, polyhedra, cubic polynomial curves, surfaces, volumes, circles, and quadratic objects such as spheres, cones, and cylinders. These 3D geometrical representations are used to model various parts of the 3D scene or object, and are expressed in the form of mathematical functions evaluated over particular values of coordinates in continuous Cartesian space. Typically, the 3D geometrical representations of the 3D geometric model are stored in the format of a graphical display list (i.e. a structured collection of 2D and 3D geometric primitives). Currently, planar polygons, mathematically described by a set of vertices, are the most popular form of 3D geometric representation.
Once modeled using continuous 3D geometrical representations, the 3D scene is graphically displayed (as a 2D view of the 3D geometrical model) along a particular viewing direction, by repeatedly scan-converting the graphical display list. At the current state of the art, the scan-conversion process can be viewed as a “computational geometry” process which involves the use of (i) a geometry processor (i.e. geometry processing subsystem or engine) as well as a pixel processor (i.e. pixel processing subsystem or engine) which together transform (i.e. project, shade and color) the display-list objects and bit-mapped textures, respectively, into an unstructured matrix of pixels. The composed set of pixel data is stored within a 2D frame buffer (i.e. Z buffer) before being transmitted to and displayed on the surface of a display screen.
A video processor/engine refreshes the display screen using the pixel data stored in the 2D frame buffer. Any changes in the 3D scene requires that the geometry and pixel processors repeat the whole computationally-intensive pixel-generation pipeline process, again and again, to meet the requirements of the graphics application at hand. For every small change or modification in viewing direction of the human system user, the graphical display list must be manipulated and repeatedly scan-converted. This, in turn, causes both computational and buffer contention challenges which slow down the working rate of the graphics system. To accelerate this computationally-intensive pipeline process, custom hardware including geometry, pixel and video engines, have been developed and incorporated into most conventional “graphics display-list” system designs.
In high-performance graphics applications, the number of computations required to render a 3D scene (from its underlying graphical display lists) and produce high-resolution graphical projections greatly exceeds the capabilities of systems employing a single graphics processing unit (GPU). Consequently, the use of parallel graphics pipelines, and multiple graphics processing units (GPUs), have become the rule for high-performance graphics system architecture and design.
In order to distribute the computational workload associated with interactive parallel graphics rendering processes, three different methods of graphics rendering have been developed over the years. These three basic methods of parallel graphics rendering are illustrated in FIGS. 1A through 1C. While these three methods of parallel graphics rendering are different in ways which will be described below, they each have five (5) basic stages or phases in common, namely:
(1) the Decomposition Phase, wherein the 3D scene or object is analyzed and its corresponding graphics display list data and commands are assigned to particular graphics pipelines available on the parallel multiple GPU-based graphics platform;
(2) the Distribution Phase, wherein the graphics display list data and commands are distributed to particular available graphics pipelines determined during the Decomposition Phase;
(3) the Rendering Phase, wherein the geometry processing subsystem/engine and the pixel processing subsystem/engine along each graphics pipeline of the parallel graphics platform uses the graphics display list data and commands distributed to its pipeline, and transforms (i.e. projects, shades and colors) the display-list objects and bit-mapped textures into a subset of unstructured matrix of pixels;
(4) the Recomposition Phase, wherein the parallel graphics platform uses the multiple sets of pixel data generated by each graphics pipeline to synthesize (or compose) a final set of pixels that are representative of the 3D scene (taken along the specified viewing direction), and this final set of pixel data is then stored in a frame buffer; and
(5) the Display Phase, wherein the final set of pixel data retreived from the frame buffer; and provided to the screen of the device device of the system. As will be explained below with reference to FIGS. 1A through 1C, each of these methods of parallel graphics rendering has both advantages and disadvantages.
Image Division Method of Parallel Graphics Rendering
As illustrated in FIG. 1A, the Image Division (Sort-First) Method of Parallel Graphics Rendering distributes all graphics display list data and commands to each of the graphics pipelines, and decomposes the final view (i.e. projected 2D image) in Screen Space, so that, each graphical contributor (e.g. graphics pipeline and GPU) renders a 2D tile of the final view. This mode has a limited scalability due to the parallel overhead caused by objects rendered on multiple tiles.
Time Division (DPlex) Method of Parallel Graphics Rendering
As illustrated in FIG. 1B, the Time Division (DPlex) Method of Parallel Graphics Rendering distributes all display list graphics data and commands associated with a first scene to the first graphics pipeline, and all graphics display list data and commands associated with a second/subsequent scene to the second graphics pipeline, so that each graphics pipeline (and its individual rendering node or GPU) handles the processing of a full, alternating image frame. Notably, while this method scales very well, the latency between user input and final display increases with scale, which is often irritating for the user.
Object Division (Sort-Last) Method of Parallel Graphics Rendering
As illustrated in FIG. 1C, the Object Division (Sort-last) Method of Parallel Graphics Rendering decomposes the 3D scene (i.e. rendered database) and distributes graphics display list data and commands associated with a portion of the scene to the particular graphics pipeline (i.e. rendering unit), and recombines the partially rendered pixel frames, during recomposition. This mode scales the rendering process very well, but implementation of the recomposition step is very expensive due to the amount of pixel data processing required during recomposition. Consequently, the practice of the Object Division Method of Parallel Graphics Rendering has not been commerically feasible in the affordable PC computing marketplace, while the Image and Time Division Methods of Parallel Graphics Rendering are being widely practiced in commercial PC-based graphics products, as indicated above.
A primary and highly desirable advantage associated with the Object Division Method of Parallel Graphics Rendering stems from dividing the stream of graphic display commands and data into partial streams, targeted to different GPUs, thereby removing traditional bottlenecks associated with polygon and texture data processing. Applications with massive polygon data (such as CAD) or massive texture data (such as high-quality video games) are able to take the most advantage of this kind of graphics rendering parallelism. Thus, there is a real need for CAD workers and video gamers who typically use PC-based computing systems and workstations have access to computer graphics subsystems that support the Object Division Method of Parallel Graphics Rendering.
Particular Prior Art Examples of Time Division and Image Division Methods of Parallel Graphics Rendering
In order to increase the level of parallelism and thus rendering performance of conventional PC-based graphics systems (i.e. beyond the converge limitations of a single-core GPU), it is now popular for conventional PC computing platforms to practice the Image and Time Division Methods of Parallel Graphics Rendering using either multiple GPU-based graphics cards, or multiple GPU chips on a graphics card. As shown in FIGS. 2A and 2B, this parallel graphics processing technique is practiced today in a number of commercial products (e.g. the SLI™ product design by Nvidia, and the Crossfire™ product design by ATI), employing a dual-card graphics subsystem, and supporting both Image and Time Division Methods/Modes of Parallel Graphics Rendering.
As shown in FIG. 2A, the PC motherboard is populated with a CPU (201) that is equipped with a memory bridge (i.e. “chipset,” 203) (e.g. nforce 680 by Nvidia). The memory bridge supports two PCI-express buses (207, 208) which are capable of driving two external graphic cards. As shown in FIG. 2A, the primary graphics card (205) and the secondary graphics card (204) are attached to a display device (206) such as a LCD panel. As shown in FIG. 2B, the architecture of a typical Shader-based graphic card (204, 205) comprises a GPU (212) and video memory (213). The GPU comprises a geometry subsystem (which is transform bound) and a pixel subsystem (which is fill bound). The video memory (213) comprises texture memory (218), a frame buffer (216), a command buffer, and a vertex buffer. The stream of graphics (display list) data and commands, originating at the host CPU, describes the 3D scene in terms of polygon vertices and bit-mapped textures. As shown, this data stream is provided to the video memory (213) via PCIexpress bus (207 or 208). In the shader-based GPU, the texture memory (218) plays a central role, and is accessible to and from the chip input, the vertex and fragment shaders, FB (216), and the blend & raster ops unit (217). As shown, the shader hardware (214, 215) is realized as a programmable parallel array of processing elements running shader source code written in a graphics-specific programming language. Notably, the Vertex Shader (214) specializes in vertex data processing, whereas the Fragment Shader (215) specializes in pixel data processing.
Particular Prior Art Examples of Object Division Method of Parallel Graphics Rendering
In FIG. 3A1, there is shown a parallel graphics system supporting the Object Division Method of Parallel Graphics Rendering, as illustrated as in FIG. 1C, but with further emphasis on the Recomposition Stage which is shown carried out using specialized apparatus. In FIG. 3A2, the basic object division recomposition process carried out by such specialized apparatus is schematically illustrated in the form of a flow chart. As described in FIG. 3A2, the first step of this pixel composition process involves accessing images (pixel data sets) from first and second frame buffers (FB1, FB20, each having a color value buffer and a depth value (Z) buffer. The second step involves performing a relatively simple process for each x,y pixel value in the frame buffers: advance to the next x,y location; for x,y, compare the depth values in the Z buffers and select the lower value which corresponds to the pixel value closest to the view (along the specified viewing direction); move the corresponding pixel value from the color buffer associated with the winning Z-buffer1 to the final FB. The determine whether or not all x,y values in the image have been processed as described above. If not, then return to the beginning of the processing loop as shown, and continue to process all pixel values in the image until the composition process is completed.
In FIGS. 3B1 and 3B2, there is shown a prior art multiple GPU-based graphics subsystem having multiple graphics pipelines with multiple GPUs supporting the Object Division Method of Parallel Graphics Rendering, using dedicated/specialized hardware to perform the basic image recomposition process illustrated in FIG. 3A2. Examples of prior art parallel graphics systems based on this design include: the Chromium™ Parallel Graphics System developed by researchers and engineers of Stanford University, and employing Binaryswap SPU hardware to carry out the image (re)composition process illustrated in FIG. 3A2; and HP Corporation's PixelFlow (following development of North Carolina University at Chapel Hill) employing parallel pipeline hardware, and SGI's Origin 2000 Supercomputer Shared Memory Compositor method (known also as “Direct Send”) on distributed memory architecture.
As shown in FIG. 3B1, the application's rendering code (301), which is representative of a 3D scene to be viewed from a particular viewing direction, is decomposed into two streams of graphics (display list) data and commands (302). These streams of graphics data and commands (302) are distributed (303) to the multiple graphics processing pipelines for rendering (304). Each GPU in its pipeline participates in only a fraction of the overall computational workload. Each frame buffer (FB) holds a full 2D image (i.e. frame of pixel data) of a sub-scene. According to this prior art method of Object Division, the full image of the 3D scene must be then composed from the viewing direction, using these two full 2D images, and this compositing process involves testing each and every pixel location for the pixel that is closest to the eye of the viewer (305). Consequently, recomposition according to this prior art Object Division Method of Parallel Graphics Rendering is expensive due to the amount of pixel data processing required during recomposition. The recomposed final FB is ultimately sent to the display device (306) for display to the human viewer.
As shown in FIG. 3B2, the dedicated/specialized hardware-based recomposition stage/phase of the object division mode of the parallel graphics rendering process of FIG. 3B1 comprises multiple stages of frame buffers (FBs), wherein each graphics pipeline will have at least one FB. In each FB, there is buffered image data comprising pixel color and depth (z) values. These pixel color and depth (z) values are processed according to the basic pixel processing algorithm of FIG. 3A2, so as to ultimately compose the final pixel data set (i.e. image) which is stored in the final frame buffer. The pixel data stored in the final frame buffer is then ultimately used to display the image on the screen of the display device using conventional video processing and refreshing techniques generally known in the art. Notably, the more graphics processing pipelines (GPUs) that are employed in the parallel graphics rendering platform, the more complex and expensive the dedicated hardware becomes to practice this prior art hard-ware based recomposition technique during the object division mode of such a parallel graphics rendering platform.
In FIGS. 3C1, 3C2 and 3C3, there is shown a prior art multiple GPU-based graphics subsystem having multiple graphics pipelines with multiple GPUs supporting the Object Division Method of Parallel Graphics Rendering, using a dedicated/specialized software solution to perform the basic image recomposition process illustrated in FIG. 3A2. Examples of prior art parallel graphics systems based on this design include: the Onyx® Parallel Graphics System developed by SGI, and employing pseudocode illustrated in FIGS. 3C2 and 3C3, to carry out the image (re)composition process illustrated in FIG. 3A2.
As shown in FIG. 3C1, the application's rendering code (301), which is representative of a 3D scene to be viewed from a particular viewing direction, is decomposed into two streams of graphics (display list) data and commands (302). These streams of graphics data and commands (302) are distributed (303) to the multiple graphics processing pipelines for rendering (304). Each GPU in its pipeline participates in only a fraction of the overall computational workload. Each frame buffer (FB) holds a full 2D image (i.e. frame of pixel data) of a sub-scene. According to this prior art method of Object Division, the full image of the 3D scene must be then composed from the viewing direction, using these two full 2D images, and this compositing process involves testing each and every pixel location for the pixel that is closest to the eye of the viewer (305). Consequently, recomposition according to this prior art Object Division Method of Parallel Graphics Rendering is expensive due to the amount of pixel data processing required during recomposition. The recomposed final FB is ultimately sent to the display device (306) for display to the human viewer.
In FIGS. 3C2, the software-based recomposition stage/phase of the object division mode of the parallel graphics rendering process of FIG. 3C1 is schematically illustrated in greater detail. As shown, this prior art image (re)composition process involves using a dedicated/specialized computational platform to implement the basic pixel processing algorithm of FIG. 3A2. As In general, this comprising dedicated/specialized computational platform comprises a plurality of CPUs for accessing and composite-processing the pixel color and z depth values of the pixel data sets buffered in the frame buffers (FBs) of each graphics pipeline supported on the parallel graphics platform. In the FB of each graphics pipeline (i.e. GPU), there is buffered image data comprising pixel color and depth (z) values. In FIG. 3C2, there is shown an illustrative example of a dedicated software-based recomposition platform employing two CPUs, and a final frame buffer FB0, to support a dual GPU-based parallel graphics rendering platform. The pixel color and depth (z) values stored in FB1 and FB2 are processed according to the basic pixel processing algorithm of FIG. 3A2, so as to ultimately compose the final pixel data set (i.e. image) which is stored in the final frame buffer FB0. FIG. 3C3 shows pseudocode that is executed by each CPU on the recomposition platform in order to carry out the pixel processing algorithm described in FIG. 3A2. The pixel data stored in the final frame buffer is then ultimately used to display the image on the screen of the display device using conventional video processing and refreshing techniques generally known in the art. Notably, the more graphics processing pipelines (GPUs) that are employed in the parallel graphics rendering platform, the more complex and expensive the software-based recomposition platform becomes to practice this prior art software based recomposition technique during the object division mode of such a parallel graphics rendering platform.
In both prior art parallel graphics systems described in FIGS. 3B1 and 3B2 and 3C1 through 3C3, the image recomposition step requires the use of dedicated or otherwise specialized computational apparatus which, when taken together with the cost associated with computational machinery within the multiple GPUs to support the rendering phase of the parallel graphics process, has put the Object Division Method outside limits of practicality and feasibility for use in connection with PC-based computing systems.
Thus, there is a great need in the art for a new and improved way of and means for practicing the object division method of parallel graphics rendering in computer graphics systems, while avoiding the shortcomings and drawbacks of such prior art methodologies and apparatus.