Field of the Invention
Over the past few decades, much of the research and development in the graphics architecture field has been concerned the ways to improve the performance of three-dimensional (3D) computer graphics rendering. Graphics architecture is driven by the same advances in semiconductor technology that have driven general-purpose computer architecture. Many of the same acceleration techniques have been used in this field, including pipelining and parallelism. The graphics rendering application, however, imposes special demands and makes available new opportunities. For example, since image display generally involves a large number of repetitive calculations, it can more easily exploit massive parallelism than can general-purpose computations.
In high-performance graphics systems, the number of computations highly exceeds the capabilities of a single processing unit, so parallel systems have become the rule of graphics architectures. A very high-level of parallelism is applied today in silicon-based graphics processing units (GPU), to perform graphics computations. Typically these computations are performed by graphics pipeline, supported by video memory, which are part of a graphic system.
FIG. 1A1 shows a conventional graphic system as part of a PC architecture, comprising: a CPU (111), system memory (112), chipset (113, 117), high speed CPU-GPU bus (114) (e.g. PCI express 16×), video (graphic) card (115) based on a single GPU, and display (116). FIG. 1A2 shows prior art chipset 113 and 117 being realized using Intel's chipsets 82915G (i.e. Graphics and Memory Controller Hub, also called “North Bridge”) and ICH6, called the I/O hub. In FIG. 1A3, prior art chipset 113, 117 is realized using Intel's chipsets 82915PL (i.e. the Memory Controller Hub (MCH)) and ICH6× (i.e. the I/O hub).
In addition to driving the system memory (123), the GMCH 113′ provides an integrated graphics device (IGD) that is capable of driving up to three displays (116′, 116″, 116′″). Notably, the GMCH 113′ does not support a dedicated local graphics memory; instead it uses part of the system memory 112. Also GMCH 113′ has the capability of supporting external graphics accelerators (115) via the PCI Express Graphics port but cannot work concurrently with the integrated graphics device (IGD). As shown in FIG. 1A3, the Memory Controller Hub (MCH) (i.e. 82915PL) 113″ supports external graphics (115, 116) only, and provides no integrated graphics device (IGD) support, as GMCH 113′ in FIG. 1A2. Also, prior art Intel® chipsets 113′ and 113″ lack generic capabilities for driving the GPUs of other major vendors, and are unable to support Nvidia's SLI graphics cards.
As shown in FIG. 2A1, the single GPU graphic pipeline can be decomposed into two major components: a geometry subsystem for processing 3D graphics primitives (e.g. polygons); and a pixel subsystem for computing pixel values. These two components are consistently designed for increased parallelism. As shown in FIG. 2A2, graphics pipeline of a prior art integrated graphics device (IGD) is shown comprising: a memory controller for feeding a video engine, a 2D engine and a 3D engine, which feeds a display engine, which in turn, feeds a Port Mux Controller along the way to an analog or digital display.
In the geometry subsystem, the graphics databases are regular, typically consisting of a large number of primitives that receive nearly identical processing; therefore the natural concurrency is to partition the data into separate streams and to process them independently. In the pixel subsystem, image parallelism has long been an attractive approach for high-speed rasterization architectures, since pixels can be generated in parallel in many ways. An example of a highly parallel Graphic Processing Unit chip (GPU) in prior art is depicted in FIG. 2B1 (taken from 3D Architecture White Paper, by ATI). The geometry subsystem consists of six (6) parallel pipes while the pixel subsystem has sixteen (16) parallel pipes.
However, as shown in FIG. 2B2, the “converge stage” 221 between these two subsystems is very problematic as it must handle the full data stream bandwidth. In the pixel subsystem, the multiple streams of transformed and clipped primitives must be directed to the processors doing rasterization. This can require sorting primitives based on spatial information while different processors are assigned to different screen regions. A second difficulty in the parallel pixel stage is that ordering of data may change as those data pass through parallel processors. For example, one processor may transform two small primitives before another processor transforms a single, large one. Certain global commands, such as commands to update one window instead of another, or to switch between double buffers, require that data be synchronized before and after command. This converge stage between the geometry and pixel stages, restricts the parallelism in a single GPU.
A typical technology increasing the level of parallelism employs multiple GPU-cards, or multiple GPU chips on a card, where the rendering performance is additionally improved, beyond the converge limitation in a single core GPU. This technique is practiced today by several academic researches (e.g. Chromium parallel graphics system by Stanford University) and commercial products (e.g. SLI—a dual GPU system by Nvidia, Crossfire—a dual GPU by ATI). FIG. 3 shows a commercial dual GPU system, Asus A8N-SLI, based on Nvidia SLI technology.
Parallelization is capable of increasing performance by releasing bottlenecks in graphic systems. FIG. 2C indicates typical bottlenecks in a graphic pipeline that breaks-down into segmented stages of bus transfer, geometric processing and fragment fill bound processing. A given pipeline is only as strong as the weakest link of one of the above stages, thus the main bottleneck determines overall throughput. As indicated in FIG. 2C, pipeline bottlenecks stem from: (231) geometry, texture, animation and meta data transfer; (232) geometry data memory limits; (233) texture data memory limits; (234) geometry transformations; and (235) fragment rendering.
There are different ways to parallelize the GPUs, such as: time-division (each GPU renders the next successive frame); image-division (each GPU renders a subset of the pixels of each frame); and object-division (each GPU renders a subset of the whole data, including geometry and textures), and derivatives and combinations of thereof. Although promising, this approach of parallelizing cluster of GPU chips suffers from some inherent problems, such as: restricted bandwidth of inter-GPU communication; mechanical complexity (e.g. size, power, and heat); redundancy of components; and high cost.
Thus, there is a great need in the art for an improved method of and apparatus for high-speed graphics processing and display, which avoids the shortcomings and drawbacks of such prior art apparatus and methodologies.