Field of the Invention
Over the past few decades, much of the research and development in the graphics architecture field has been concerned the ways to improve the performance of three-dimensional (3D) computer graphics rendering. Graphics architecture is driven by the same advances in semiconductor technology that have driven general-purpose computer architecture. Many of the same acceleration techniques have been used in this field, including pipelining and parallelism. The graphics rendering application, however, imposes special demands and makes available new opportunities. For example, since image display generally involves a large number of repetitive calculations, it can more easily exploit massive parallelism than can general-purpose computations.
In high-performance graphics systems, the number of computations highly exceeds the capabilities of a single processing unit, so parallel systems have become the rule of graphics architectures. A very high-level of parallelism is applied today in silicon-based graphics processing units (GPU), to perform graphics computations.
Typically these computations are performed by graphics pipeline, supported by video memory, which are part of a graphic system. FIG. 1A shows a block diagram of a conventional graphic system as part of PC architecture, comprising of CPU (111), system memory (112), I/O chipset (113), high speed CPU-GPU bus (114) (e.g. PCI express 16×), video (graphic) card (115) based on a single GPU, and display (116). The single GPU graphic pipeline, as shown in FIG. 1B, decomposes into two major parts: a geometry subsystem for processing 3D graphics primitives (e.g. polygons) and a pixel subsystem for computing pixel values. These two parts are consistently designed for increased parallelism.
In the geometry subsystem, the graphics databases are regular, typically consisting of a large number of primitives that receive nearly identical processing; therefore the natural concurrency is to partition the data into separate streams and to process them independently. In the pixel subsystem, image parallelism has long been an attractive approach for high-speed rasterization architectures, since pixels can be generated in parallel in many ways. An example of a highly parallel Graphic Processing Unit chip (GPU) in prior art is depicted in FIG. 2A (taken from 3D Architecture White Paper, by ATI). The geometry subsystem consists of six (6) parallel pipes while the pixel subsystem has sixteen (16) parallel pipes.
However, as shown in FIG. 2B, the “converge stage” 221 between these two subsystems is very problematic as it must handle the full data stream bandwidth. In the pixel subsystem, the multiple streams of transformed and clipped primitives must be directed to the processors doing rasterization. This can require sorting primitives based on spatial information while different processors are assigned to different screen regions. A second difficulty in the parallel pixel stage is that ordering of data may change as those data pass through parallel processors. For example, one processor may transform two small primitives before another processor transforms a single, large one. Certain global commands, such as commands to update one window instead of another, or to switch between double buffers, require that data be synchronized before and after command. This converge stage between the geometry and pixel stages, restricts the parallelism in a single GPU.
A typical technology increasing the level of parallelism employs multiple GPU-cards, or multiple GPU chips on a card, where the rendering performance is additionally improved, beyond the converge limitation in a single core GPU. This technique is practiced today by several academic researches (e.g. Chromium parallel graphics system by Stanford University) and commercial products (e.g. SLI—a dual GPU system by Nvidia, Crossfire—a dual GPU by ATI). FIG. 3 shows a commercial dual GPU system, Asus A8N-SLI, based on Nvidia SLI technology.
Parallelization is capable of increasing performance by releasing bottlenecks in graphic systems. FIG. 2C indicates typical bottlenecks in a graphic pipeline that breaks-down into segmented stages of bus transfer, geometric processing and fragment fill bound processing. A given pipeline is only as strong as the weakest link of one of the above stages, thus the main bottleneck determines overall throughput. As indicated in FIG. 2C, pipeline bottlenecks stem from: (231) geometry, texture, animation and meta data transfer, (232) geometry data memory limits, (233) texture data memory limits, (234) geometry transformations, and (235) fragment rendering.
There are different ways to parallelize the GPUs, such as: time-division (each GPU renders the next successive frame); image-division (each GPU renders a subset of the pixels of each frame); and object-division (each GPU renders a subset of the whole data, including geometry and textures), and derivatives and combinations of thereof. Although promising, this approach of parallelizing cluster of GPU chips suffers from some inherent problems, such as: restricted bandwidth of inter-GPU communication; mechanical complexity (e.g. size, power, and heat); redundancy of components; and high cost.
Thus, there is a great need in the art for an improved method of and apparatus for high-speed graphics processing and display, which avoids the shortcomings and drawbacks of such prior art apparatus and methodologies.