1. Field of Invention
The present invention relates generally to the field of computer graphics rendering, and more particularly, ways of and means for improving the performance of parallel graphics rendering processes supported on multiple GPU-based 3D graphics platforms associated with diverse types of computing machinery.
2. Brief Description of the State of Knowledge in the Art
There is a great demand for high performance three-dimensional (3D) computer graphics systems in the fields of product design, simulation, virtual-reality, video-gaming, scientific research, and personal computing (PC). Clearly a major goal of the computer graphics industry is to realize real-time photo-realistic 3D imagery on PC-based workstations, desktops, laptops, and mobile computing devices.
In general, there are two fundamentally different classes of machines in the 3D computer graphics field, namely: (1) Object-Oriented Graphics Systems, also known as Graphical Display List (GDL) Graphics Systems, wherein 3D scenes are represented as a complex of geometric objects (primitives) in 3D continuous geometric space, and 2D views or images of such 3D scenes are computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, typically based upon laws of physics; and (2) VOlume ELement (VOXEL) Graphics Systems, wherein 3D scenes and objects are represented as a complex of voxels (x,y,z volume elements) represented in 3D Cartesian Space, and 2D views or images of such 3D voxel-based scenes are also computed using geometrical projection, ray tracing, and light scattering/reflection/absorption modeling techniques, again typically based upon laws of physics. Examples of early GDL-based graphics systems are disclosed in U.S. Pat. No. 4,862,155, whereas examples of early voxel-based 3D graphics systems are disclosed in U.S. Pat. No. 4,985,856, each incorporated herein by reference in its entirety.
In the contemporary period, most PC-based computing systems include a 3D graphics subsystem based the “Object-Orient Graphics” (or Graphical Display List) system design. In such graphics system design, “objects” within a 3D scene are represented by 3D geometrical models, and these geometrical models are typically constructed from continuous-type 3D geometric representations including, for example, 3D straight line segments, planar polygons, polyhedra, cubic polynomial curves, surfaces, volumes, circles, and quadratic objects such as spheres, cones, and cylinders. These 3D geometrical representations are used to model various parts of the 3D scene or object, and are expressed in the form of mathematical functions evaluated over particular values of coordinates in continuous Cartesian space. Typically, the 3D geometrical representations of the 3D geometric model are stored in the format of a graphical display list (i.e. a structured collection of 2D and 3D geometric primitives). Currently, planar polygons, mathematically described by a set of vertices, are the most popular form of 3D geometric representation.
Once modeled using continuous 3D geometrical representations, the 3D scene is graphically displayed (as a 2D view of the 3D geometrical model) along a particular viewing direction, by repeatedly scan-converting the graphical display list. At the current state of the art, the scan-conversion process can be viewed as a “computational geometry” process which involves the use of (i) a geometry processor (i.e. geometry processing subsystem or engine) as well as a pixel processor (i.e. pixel processing subsystem or engine) which together transform (i.e. project, shade and color) the display-list objects and bit-mapped textures, respectively, into an unstructured matrix of pixels. The composed set of pixel data is stored within a 2D frame buffer (i.e. Z buffer) before being transmitted to and displayed on the surface of a display screen.
A video processor/engine refreshes the display screen using the pixel data stored in the 2D frame buffer. Any changes in the 3D scene requires that the geometry and pixel processors repeat the whole computationally-intensive pixel-generation pipeline process, again and again, to meet the requirements of the graphics application at hand. For every small change or modification in viewing direction of the human system user, the graphical display list must be manipulated and repeatedly scan-converted. This, in turn, causes both computational and buffer contention challenges which slow down the working rate of the graphics system. To accelerate this computationally-intensive pipeline process, custom hardware including geometry, pixel and video engines, have been developed and incorporated into most conventional “graphics display-list” system designs.
In order to render a 3D scene (from its underlying graphical display lists) and produce high-resolution graphical projections for display on a display device, such as a LCD panel, early 3D graphics systems attempted to relieve the host CPU of computational loading by employing a single graphics pipeline comprising a single graphics processing unit (GPU), supported by video memory.
As shown in FIG. 1A, a typical PC based graphic architecture has an external graphics card (105). The main components of the graphics card (105) are the graphics processing unit (GPU) and video memory, as shown. As shown, the graphic card is connected to the display (106) on one side, and the CPU (101) through bus (e.g. PCIExpress) (107) and Memory Bridge (103, termed also “chipset”, e.g. 975 by Intel), on the other side.
FIG. 1B illustrates a rendering of three successive frames by a single GPU. The application, assisted by graphics library, creates a stream of graphics commands and data describing a 3D scene. The stream is pipelined through the GPU's geometry and pixel subsystems to create a bitmap of pixels in the Frame Buffer, and finally displayed on a display screen. A sequence of successive frames generates a visual illusion of a dynamic picture.
As shown in FIG. 1B, the structure of a GPU subsystem on a graphic card comprises: a video memory which is external to GPU, and two 3D engines: (i) a transform bound geometry subsystem (224) for processing 3D graphics primitives; (ii) and a fill bound pixel subsystem (225). The video memory shares its storage resources among geometry buffer (222) through which all geometric (i.e. polygonal) data is transferred, commands buffer, texture buffers (223), and Frame Buffer (226).
Limitations of a single graphics pipeline rise from its typical bottlenecks. The first potential bottleneck (221) stems from transferring data from CPU to GPU. Two other bottlenecks are video memory related: geometry data memory limits (222), and texture data memory limits (223). There are two additional bottlenecks inside the GPU: transform bound (224) in the geometry subsystem, and fragment rendering (225) in pixel subsystem. These bottlenecks determine overall throughput. In general, the bottlenecks vary over the course of a graphics application.
In high-performance graphics applications, the number of computations required to render a 3D scene and produce high-resolution graphical projections, greatly exceeds the capabilities of systems employing a single GPU graphics subsystem. Consequently, the use of parallel graphics pipelines, and multiple graphics processing units (GPUs), have become the rule for high-performance graphics system architecture and design, in order to relieve the overload presented by the different bottlenecks associated with single GPU graphics subsystems.
In FIG. 2A, there is shown an advanced chipset (e.g. Bearlake by Intel) having two buses (107, 108) instead of one, and allowing the interconnection of two external graphics cards in parallel: primary card (105) and secondary card (104), to share the computation load associated with the 3D graphics rendering process. As shown, the display (106) is attached to the primary card (105). It is anticipated that even more advanced commercial chipsets with >2 busses will appear in the future, allowing the interconnection of more than two graphic cards.
As shown in FIG. 2B, the general software architecture of prior art graphic system (200) comprises: the graphics application (201), standard graphics library (202), and vendor's GPU driver (203). This graphic software environment resides in the “program space” of main memory (102) on the host computer system. As shown, the graphic application (201) runs in the program space, building up the 3D scene, typically as a data base of polygons, each polygon being represented as a set of vertices. The vertices and others components of these polygons are transferred to the graphic card(s) for rendering, and displayed as a 2D image, on the display screen.
In FIG. 2C, the structure of a GPU subsystem on the graphics card is shown as comprising: a video memory disposed external to the GPU, and two 3D engines: (i) a transform bound geometry subsystem (224) for processing 3D graphics primitives; and (ii) a fill bound pixel subsystem (225). The video memory shares its storage resources among geometry buffer (222), through which all geometric (i.e. polygonal) data is transferred to the commands buffer, texture buffers (223), and Frame Buffer FB (226).
As shown in FIG. 2C, the division of graphics data among GPUs reduces (i) the bottleneck (222) posed by the video memory footprint at each GPU, (ii) the transform bound processing bottleneck (224), and (iii) the fill bound processing bottleneck (225).
However, when using a multiple GPU graphics architecture of the type shown in FIGS. 2A through 2C, there is a need to distribute the computational workload associated with interactive parallel graphics rendering processes. To achieve this objective, two different kind of parallel rendering methods have been applied to PC-based dual GPU graphics systems of the kind illustrated in FIGS. 2A through 2C, namely: the Time Division Method of Parallel Graphics Rendering illustrated in FIG. 2D; and the Image Division Method of Parallel Graphics Rendering illustrated in FIG. 2E.
Notably, a third type of method of parallel graphics rendering, referred to as the Object Division Method, has been developed over the years and practiced exclusively on complex computing platforms requiring complex and expensive hardware platforms for compositing the pixel output of the multiple graphics pipelines. The Object Division Method, illustrated in FIG. 3A, can be found applied on conventional graphics platforms of the kind shown in FIG. 3, as well as specialized graphics computing platforms as described in US Patent Application Publication No. US 2002/0015055, assigned to Silicon Graphics, Inc. (SGI), published on Feb. 7, 2002, and incorporated herein by reference.
While the differences between the Image, Frame and Object Division Methods of Parallel Graphics Rendering will be described below, it will be helpful to first briefly describe the five (5) basic stages or phases of the parallel rendering process, which all three such methods have in common, namely:
(1) the Decomposition Phase, wherein the 3D scene or object is analyzed and its corresponding graphics display list data and commands are assigned to particular graphics pipelines available on the parallel multiple GPU-based graphics platform;
(2) the Distribution Phase, wherein the graphics display list data and commands are distributed to particular available graphics pipelines determined during the Decomposition Phase;
(3) the Rendering Phase, wherein the geometry processing subsystem/engine and the pixel processing subsystem/engine along each graphics pipeline of the parallel graphics platform uses the graphics display list data and commands distributed to its pipeline, and transforms (i.e. projects, shades and colors) the display-list objects and bit-mapped textures into a subset of unstructured matrix of pixels;
(4) the Recomposition Phase, wherein the parallel graphics platform uses the multiple sets of pixel data generated by each graphics pipeline to synthesize (or compose) a final set of pixels that are representative of the 3D scene (taken along the specified viewing direction), and this final set of pixel data is then stored in a frame buffer; and
(5) the Display Phase, wherein the final set of pixel data retrieved from the frame buffer; and provided to the screen of the device of the system. As will be explained below with reference to FIGS. 3B through 3D, each of these methods of parallel graphics rendering has both advantages and disadvantages.
Image Division Method of Parallel Graphics Rendering
As illustrated in FIG. 2D, the Image Division (Sort-First) Method of Parallel Graphics Rendering distributes all graphics display list data and commands to each of the graphics pipelines, and decomposes the final view (i.e. projected 2D image) in Screen Space, so that, each graphical contributor (e.g. graphics pipeline and GPU) renders a 2D tile of the final view. This mode has a limited scalability due to the parallel overhead caused by objects rendered on multiple tiles. There are two image domain modes, all well known in prior art. They differ by the way the final image is divided among GPUs.
(1) The Split Frame Rendering mode divides up the screen among GPUs by continuous segments. e.g. two GPUs each one handles about one half of the screen. The exact division may change dynamically due to changing load across the screen image. This method is used inVidia's SLI™ multiple-GPU graphics product.
(2) Tiled Frame Rendering mode divides up the image into small tiles. Each GPU is assigned tiles that are spread out across the screen, contributing to good load balancing. This method is implemented by ATI's Crossfire™ multiple GPU graphics card solution.
In image division, the entire database is broadcast to each GPU for geometric processing. However, the processing load at each Pixel Subsystem is reduced to about 1/N. This way of parallelism relieves the fill bound bottleneck (225). Thus, the image division method ideally suits graphics applications requiring intensive pixel processing.
Time Division (DPlex) Method of Parallel Graphics Rendering
As illustrated in FIG. 2F, the Time Division (DPlex) Method of Parallel Graphics Rendering distributes all display list graphics data and commands associated with a first scene to the first graphics pipeline, and all graphics display list data and commands associated with a second/subsequent scene to the second graphics pipeline, so that each graphics pipeline (and its individual rendering node or GPU) handles the processing of a full, alternating image frame. Notably, while this method scales very well, the latency between user input and final display increases with scale, which is often irritating for the user. Each GPU is give extra time of N time frames (for N parallel GPUs) to process a frame. Referring to FIG. 3, the released bottlenecks are those of transform bound (224) at geometry subsystem, and fill bound (225) at pixel subsystem. Though, with large data sets, each GPU must access all of the data. This requires either maintaining multiple copies of large data sets or creating possible access conflicts to the source copy at the host swelling up the video memory bottlenecks (222, 223) and data transfer bottleneck (221).
Object Division (Sort-Last) Method of Parallel Graphics Rendering
As illustrated in FIG. 3B, the Object Division (Sort-last) Method of Parallel Graphics Rendering decomposes the 3D scene (i.e. rendered database) and distributes graphics display list data and commands associated with a portion of the scene to the particular graphics pipeline (i.e. rendering unit), and recombines the partially rendered pixel frames, during recomposition. The geometric database is therefore shared among GPUs, offloading the geometry buffer and geometry subsystem, and even to some extend the pixel subsystem. The main concern is how to divide the data in order to keep load balance. An exemplary multiple-GPU platform of FIG. 3B for supporting the object-division method is shown in FIG. 3A. The platform requires complex and costly pixel compositing hardware which prevents its current application in a modern PC-based computer architecture.
Today, real-time graphics applications, such as advanced video games, are more demanding than ever, utilizing massive textures, abundance of polygons, high depth-complexity, anti-aliasing, multipass rendering, etc., with such robustness growing exponentially over time.
Clearly, conventional PC-based graphics system fail to address the dynamically changing needs of modern graphics applications. By their vary nature, prior art PC-based graphics systems are unable to resolve the variety of bottlenecks that dynamically arise along graphics applications. Consequently, such prior art graphics systems are often unable to maintain a high and steady level of performance throughout a particular graphics application.
Indeed, a given pipeline along a parallel graphics system is only as strong as the weakest link of it stages, and thus a single bottleneck determines the overall throughput along the graphics pipelines, resulting in unstable frame-rate, poor scalability, and poor performance.
While each parallelization mode described above solves only part of the bottleneck dilemma, currently existing along the PC-based graphics pipelines, no one parallelization method, in and of itself, is sufficient to resolve all bottlenecks in demanding graphics applications.
Thus, there is a great need in the art for a new and improved way of and means for practicing parallel 3D graphics rendering processes in modern multiple-GPU based computer graphics systems, while avoiding the shortcomings and drawbacks of such prior art methodologies and apparatus.