1. Field of Invention
The present invention relates generally to the field of computer graphics rendering, and more particularly, ways of and means for improving the performance of parallel graphics rendering processes supported on multiple 3D graphics processing pipeline (GPPL) platforms associated with diverse types of computing machinery, including, but not limited, to PC-level computers, game console systems, graphics-supporting application servers, and the like.
2. Brief Description of the State of Knowledge in the Art
There is a great demand for high performance three-dimensional (3D) computer graphics systems in the fields of product design, simulation, virtual-reality, video-gaming, scientific research, and personal computing (PC). Clearly a major goal of the computer graphics industry is to realize real-time photo-realistic 3D imagery on PC-based workstations, desktops, laptops, and mobile computing devices. In general, there are two fundamentally different classes of machines in the 3D computer graphics field, namely: (1) Object-Oriented Graphics Systems, wherein 3D scenes are represented as a complex of geometric objects (primitives) in 3D continuous geometric space, and 2D views or images of such 3D scenes are computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, typically based upon laws of physics; and (2) VOlume ELement (VOXEL) Graphics Systems, wherein 3D scenes and objects are represented as a complex of voxels (x,y,z volume elements) represented in 3D Cartesian Space, and 2D views or images of such 3D voxel-based scenes are also computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, again typically based upon laws of physics. Examples of early GDL-based graphics systems are disclosed in U.S. Pat. No. 4,862,155, whereas examples of early voxel-based 3D graphics systems are disclosed in U.S. Pat. No. 4,985,856, each incorporated herein by reference in its entirety. In the contemporary period, most PC-based computing systems include a 3D graphics subsystem based the “Object-Orient Graphics” system design. In such graphics system design, “objects” within a 3D scene are represented by 3D geometrical models, and these geometrical models are typically constructed from continuous-type 3D geometric representations including, for example, 3D straight line segments, planar polygons, polyhedra, cubic polynomial curves, surfaces, volumes, circles, and quadratic objects such as spheres, cones, and cylinders (i.e. geometrical data and commands). These 3D geometrical representations are used to model various parts of the 3D scene or object, and are expressed in the form of mathematical functions evaluated over particular values of coordinates in continuous Cartesian space. Typically, the 3D geometrical representations of the 3D geometric model are stored in the format of a graphical display list (i.e. a structured collection of 2D and 3D geometric primitives). Currently, planar polygons, mathematically described by a set of vertices, are the most popular form of 3D geometric representation.
Once modeled using continuous 3D geometrical representations, the 3D scene is graphically displayed (as a 2D view of the 3D geometrical model) along a particular viewing direction, by repeatedly scan-converting the stream of graphics commands and data (GCAD). At the current state of the art, the scan-conversion process can be viewed as a “computational geometry” process which involves the use of (i) a geometry processor (i.e. geometry processing subsystem or engine) as well as a pixel processor (i.e. pixel processing subsystem or engine) which together transform (i.e. project, shade and color) the graphics objects and bit-mapped textures, respectively, into an unstructured matrix of pixels. The composed set of pixel data is stored within a 2D frame buffer (i.e. Z buffer) before being transmitted to and displayed on the surface of a display screen.
A video processor/engine refreshes the display screen using the pixel data stored in the 2D frame buffer. Any changes in the 3D scene requires that the geometry and pixel processors repeat the whole computationally-intensive pixel-generation pipeline process, again and again, to meet the requirements of the graphics application at hand. For every small change or modification in viewing direction of the human system user, the graphical display list must be manipulated and repeatedly scan-converted. This, in turn, causes both computational and buffer contention challenges which slow down the working rate of the graphics system. To accelerate this computationally-intensive graphics processing pipeline process, custom hardware including geometry, pixel and video engines, have been developed and incorporated into most conventional graphics system designs.
In order to render a 3D scene (from its underlying graphics commands and data) and produce high-resolution graphical projections for display on a display device, such as a LCD panel, early 3D graphics systems attempted to relieve the host CPU of computational loading by employing a single graphics pipeline comprising a single graphics processing unit (GPU), supported by video memory.
As shown in FIGS. 1A1, 1A2 and 1A3, a typical PC based graphic architecture has an external graphics card 105 comprising a graphics processing unit (GPU) and video memory. As shown, the graphic card is connected to the display 106 on one side, and the CPU 101 through bus (e.g. PCI-Express) 107 and Memory Bridge 103 (termed also “chipset”, e.g. 975 by Intel), on the other side. As shown in FIG. 1A3, the host CPU program/memory space stores the graphics applications, the standard graphics library, and the vendor's GPU drivers.
As shown in FIGS. 1B1, 1B2 and 1B3, a typical prior art PC-based computing system employs a conventional graphics architecture employing a North memory bridge with an integrated graphics device (IGD) 103. The IGD supports a single graphics pipeline process, and is operably coupled to a South bridge, via a PCI-express bus, for supporting the input/output ports of the system. As shown, the IGD includes a video engine, a 2D engine, a 3D engine, and a display engine.
As shown in FIG. 1B4, a prior art PC-based computing system employs a conventional Fusion-type CPU/GPU hybrid architecture, wherein a single GPU implemented on the same die as the CPU is used to support a graphics pipeline that drives an external display device. As shown, the motherboard supports the processor die, memory, a bridge with a display interface for connecting to a display device 106, and a PCI-express bus. As shown, the processor die supports a CPU 1241, a GPU 1242, L2 cache, buffers, an Interconnect (e.g. crossbar switch), a hyper transport mechanism and a memory controller.
As shown in FIG. 1C, the process of rendering three successive frames by a single GPU is graphically illustrated. Notably, this graphical rendering process may be supported using any of the single GPU-based computing systems described above. During operation, the application, assisted by the graphics library, creates a stream of graphics commands and data describing a 3D scene. The stream is then pipelined through the GPU's geometry and pixel subsystems so as to create a bitmap of pixels in the Frame Buffer, and finally a rendered image of the scene is displayed on a display screen. The generation of a sequence of successive frames produces a visual illusion of a dynamic picture.
While the performance of single-GPU powered computing systems has greatly improved, as shown in FIG. 1B5, the structure of a GPU subsystem 124 on a graphics card or in an IGD comprises: a video memory which is external to GPU, and two 3D engines: (i) a transform bound geometry subsystem 224 for processing 3D graphics primitives; (ii) and a fill bound pixel subsystem 225. The video memory shares its storage resources among geometry buffer 222 through which all geometric (i.e. polygonal) data is transferred, commands buffer, texture buffers 223, and Frame Buffer 226.
Limitations of a single graphics pipeline arise from its typical bottlenecks. The first potential bottleneck 221 stems from transferring data from CPU to GPU. Two other bottlenecks are video memory related: geometry data memory limits 222, and texture data memory limits 223. There are two additional bottlenecks inside the GPU: transform bound 224 in the geometry subsystem, and fragment rendering 225 in pixel subsystem. These bottlenecks determine overall throughput. In general, the bottlenecks vary over the course of a graphics application.
In high-performance graphics applications, the number of computations required to render a 3D scene and produce high-resolution graphical projections, greatly exceeds the capabilities of systems employing a single GPU graphics subsystem. Consequently, the use of parallel graphics pipelines, and multiple graphics processing units (GPUs), have become the rule for high-performance graphics system architecture and design, in order to relieve the overload presented by the different bottlenecks associated with single GPU graphics subsystems.
In FIG. 2A, there is shown an advanced chipset (e.g. Bearlake by Intel) having two buses 107, 108 instead of one, and allowing the interconnection of two external graphics cards in parallel: primary card 105 and secondary card 104, to share the computation load associated with the 3D graphics rendering process. As shown, the display 106 is attached to the primary card 105. It is anticipated that even more advanced commercial chipsets with greater than two buses will appear in the future, allowing the interconnection of more than two graphic cards.
As shown in FIG. 2B, the general software architecture of prior art graphic system 200 comprises: the graphics application 201, standard graphics library 202, and the vendor's GPU drivers (203). This graphic software environment resides in the “program space” of main memory 102 on the host computer system. As shown, the graphic application 201 runs in the program space (i.e. memory space), building up the 3D scene, typically as a data base of polygons, where each polygon is represented as a set of vertices. The vertices and others components of these polygons are transferred to the graphic card(s) for rendering, and displayed as a 2D image, on the display screen.
In FIG. 2C, the structure of a GPU subsystem on the graphics card is shown comprising: a video memory disposed external to the GPU, and two 3D engines: (i) a transform bound geometry subsystem 224 for processing 3D graphics primitives; and (ii) a fill bound pixel subsystem 225. The video memory shares its storage resources among geometry buffer 222, through which all geometric (i.e. polygonal) data is transferred to the commands buffer, texture buffers 223, and Frame Buffer FB 226.
As shown in FIG. 2C, the division of graphics data among GPUs reduces (i) the bottleneck 222 posed by the video memory footprint at each GPU, (ii) the transform bound processing bottleneck 224, and (iii) the fill bound processing bottleneck 225.
However, when using a multiple GPU graphics architecture of the type shown in FIGS. 2A through 2C, there is a need to distribute the computational workload associated with interactive parallel graphics rendering processes. To achieve this objective, two different kind of parallel rendering methods have been applied to PC-based dual GPU graphics systems of the kind illustrated in FIGS. 2A through 2C, namely: the Time Division Method of Parallel Graphics Rendering illustrated in FIG. 2D; and the Image Division Method of Parallel Graphics Rendering illustrated in FIG. 2E.
Notably, a third type of method of parallel graphics rendering, referred to as the Object Division Method, has been developed over the years and practiced exclusively on complex computing platforms requiring complex and expensive hardware platforms for compositing the pixel output of the multiple graphics processing pipelines (GPPLs). The Object Division Method, illustrated in FIG. 3A, can be found applied on conventional graphics platforms of the kind shown in FIG. 3, as well as on specialized graphics computing platforms as described in US Patent Application Publication No. US 2002/0015055, assigned to Silicon Graphics, Inc. (SGI), published on Feb. 7, 2002, and incorporated herein by reference.
While the differences between the Image, Frame and Object Division Methods of Parallel Graphics Rendering will be described below, it will be helpful to first briefly describe the five (5) basic stages or phases of the parallel graphics rendering process, which all three such methods of parallel rendering have in common, namely:
(1) the Decomposition Phase, wherein the 3D scene or object is analyzed and its corresponding graphics display list data and commands are assigned to particular graphics pipelines available on the parallel multiple GPU-based graphics platform;
(2) the Distribution Phase, wherein the graphics data and commands are distributed to particular available graphics processing pipelines determined during the Decomposition Phase;
(3) the Rendering Phase, wherein the geometry processing subsystem/engine and the pixel processing subsystem/engine along each graphics processing pipeline of the parallel graphics platform uses the graphics data and commands distributed to its pipeline, and transforms (i.e. projects, shades and colors) the graphics objects and bit-mapped textures into a subset of unstructured matrix of pixels;
(4) the Recomposition Phase, wherein the parallel graphics platform uses the multiple sets of pixel data generated by each graphics pipeline to synthesize (or compose) a final set of pixels that are representative of the 3D scene (taken along the specified viewing direction), and this final set of pixel data is then stored in a frame buffer (FB); and
(5) the Display Phase, wherein the final set of pixel data retrieved from the frame buffer, and provided to the screen of the device of the system.
As will be explained below with reference to FIGS. 3B through 3D, each of these three different methods of parallel graphics rendering has both advantages and disadvantages.
Image Division Method of Parallel Graphics Rendering
As illustrated in FIG. 2D, the Image Division (Sort-First) Method of Parallel Graphics Rendering distributes all graphics display list data and commands to each of the graphics pipelines, and decomposes the final view (i.e. projected 2D image) in Screen Space, so that, each graphical contributor (e.g. graphics pipeline and GPU) renders a 2D tile of the final view. This mode has a limited scalability due to the parallel overhead caused by objects rendered on multiple tiles. There are two image domain modes, all well known in prior art. They differ by the way the final image is divided among GPUs.
(1) The Split Frame Rendering mode divides up the screen among GPUs by continuous segments. e.g. two GPUs each one handles about one half of the screen. The exact division may change dynamically due to changing load across the screen image. This method is used in nVidia's SLI™ multiple-GPU graphics product.
(2) Tiled Frame Rendering mode divides up the image into small tiles. Each GPU is assigned tiles that are spread out across the screen, contributing to good load balancing. This method is implemented by ATI's Crossfire™ multiple GPU graphics card solution.
In image division, the entire database is broadcast to each GPU for geometric processing. However, the processing load at each Pixel Subsystem is reduced to about 1/N. This way of parallelism relieves the fill bound bottleneck 225. Thus, the image division method ideally suits graphics applications requiring intensive pixel processing.
Time Division (DPlex) Method of Parallel Graphics Rendering
As illustrated in FIG. 2F, the Time Division (DPlex) Method of Parallel Graphics Rendering distributes all display list graphics data and commands associated with a first scene to the first graphics pipeline, and all graphics display list data and commands associated with a second/subsequent scene to the second graphics pipeline, so that each graphics pipeline (and its individual rendering node or GPU) handles the processing of a full, alternating image frame. Notably, while this method scales very well, the latency between user input and final display increases with scale, which is often irritating for the user. Each GPU is give extra time of N time frames (for N parallel GPUs) to process a frame. Referring to FIG. 3, the released bottlenecks are those of transform bound 224 at geometry subsystem, and fill bound 225 at pixel subsystem. Though, with large data sets, each GPU must access all of the data. This requires either maintaining multiple copies of large data sets or creating possible access conflicts to the source copy at the host swelling up the video memory bottlenecks 222, 223 and data transfer bottleneck 221.
Object Division (Sort-Last) Method of Parallel Graphics Rendering
As illustrated in FIG. 3B, the Object Division (Sort-Last) Method of Parallel Graphics Rendering decomposes the 3D scene (i.e. rendered database) and distributes graphics display list data and commands associated with a portion of the scene to the particular graphics pipeline (i.e. rendering unit), and recombines the partially rendered pixel frames, during recomposition. The geometric database is therefore shared among GPUs, reducing the load on the geometry buffer, the geometry subsystem, and even to some extent, the pixel subsystem. The main concern is how to divide the data in order to keep load balance. An exemplary multiple-GPU platform of FIG. 3B for supporting the object-division method is shown in FIG. 3A. The platform requires complex and costly pixel compositing hardware which prevents its current application in a modern PC-based computer architecture.
Today, real-time graphics applications, such as advanced video games, are more demanding than ever, utilizing massive textures, abundance of polygons, high depth-complexity, anti-aliasing, multi-pass rendering, etc., with such robustness growing exponentially over time.
Conventional PC-level dual-mode parallel graphics systems employing multiple-GPUs, such as nVidia's SLI™ multiple-GPU graphics platform, support either the Time Division Mode (termed Alternate Frame Rendering) of parallelism, or the Image Division Mode (termed Split Frame Rendering) of parallelism, which is automatically selected during application set-up (e.g. by the vendor's driver). However, once a graphics-based application is set-up and the time or image division mode of parallel operation selected, the selected mode of parallel operation is fixed during application run-time.
Clearly, conventional PC-based graphics systems fail to address the dynamically changing needs of modern graphics applications. By their very nature, prior art PC-based graphics systems are unable to resolve the variety of bottlenecks (e.g. geometry limited, pixel limited, data transfer limited, and memory limited) summarized in FIG. 3C1, that dynamically arise along 3D graphic pipelines. Consequently, such prior art graphics systems are often unable to maintain a high and steady level of performance throughout a particular graphics application.
Indeed, a given graphics processing pipeline along a parallel graphics rendering system is only as strong as the weakest link of it stages, and thus a single bottleneck determines the overall throughput along the graphics pipelines, resulting in unstable frame-rate, poor scalability, and poor performance.
And while each parallelization mode described above and summarized in FIG. 3C2 solves only part of the bottleneck dilemma currently existing along the PC-based graphics pipelines, no one parallelization method, in and of itself, is sufficient to resolve all bottlenecks in demanding graphics applications, and enable quantum leaps in graphics performance necessary for photo-realistic imagery demanded in real-time interactive graphics environments.
Thus, there is a great need in the art for a new and improved way of and means for practicing parallel 3D graphics rendering processes in modern multiple-GPU based computer graphics systems, while avoiding the shortcomings and drawbacks of such prior art methodologies and apparatus.