1. Field of the Invention
This invention relates to the field of computer graphics, specifically 3d graphics hardware accelerators.
2. Description of the Related Art
Most conventional general purpose computers have some form of hardware sub-system that can couple information stored or computed within the computer to some form of physical image display devices as interactive visual feed-back to the human user(s). While decades ago these physical image display devices and the special electronics that coupled the computer to them were very primitive, e.g., blinking lights, “glass ttys”, or oscilloscopes, over time the sophistication has grown to the point where the hardware sub-system, or graphics system dedicated to driving the physical image display devices are quite complex, specialized computational systems in their own right. Indeed, many of current “graphics chips” that are used to build conventional graphics systems contain more transistors than the powerful single chip CPUs in the general purpose computers themselves.
Specifically, a graphics system does more than connect a host computer to a physical image display device. It also offloads from the host computer more and more complex rendering operations, including both 2d rendering 3d rendering. A hardware accelerator dedicated to a specialized task will usually have a performance and/or price advantage over performing the same task entirely in software on a general purpose computer. This, of course, assumes that there is sufficient customer demand for frequently performing the specialized task, which is the case for 2d and 3d computer graphics in many market segments, including both industrial and consumer home entertainment.
While early graphics systems might only take on the simple job of drawing 2d lines or text, more advanced high performance graphics systems are responsible for taking high level representations of three dimensional objects from the host computer, and performing much of the job of approximately computing a simulation of how photons in the real world would illuminate the group of objects, and how images of these objects would be formed within the image plane of a physical camera, or the physical human eye. In other words, modern graphics systems are capable of performing 3d rendering. Thus, rather than the generic term “graphics systems” they will be referred to as “3d graphics hardware accelerators”. A final synthetic “image plane” becomes the video output signal that is sent from the 3d graphics hardware accelerator to various physical image display devices for viewing by the human user(s). These physical image display devices include, but are not restricted to: direct view CRTs, direct view LCD panels, direct view plasma panels, direct view electroluminescent displays, LED based displays, CRT based projectors, LCD based projectors, LCOS based projectors, DMD based projectors, laser based projectors, as well as head mounted displays (HMDs).
The recent pace of development of more and more powerful 3d graphics hardware accelerators has spurred the need to continuously develop new architectural concepts to build 3d graphics hardware accelerators capable of generating much richer images of 3d objects than was possible with previous architectural concepts. The architectural concepts that were used to build the then highest performance 3d graphics hardware accelerators may no longer apply when new building blocks based on ever more powerful semiconductor chips are to be used even a few years later. At the same time, given the also increasing costs of developing individual chips, it is also desirable to find 3d graphics hardware accelerator architectures that are highly scalable, that is, architectures that allow a wide range of commercially viable products at many different price/performance points to be constructed from the same small set of chips.
Two features in particular that it are highly desirable to support in the next decades worth of high performance 3d graphics hardware accelerator products are fully programmable shading and high quality antialiasing. High quality antialiasing produces more realistic looking images by reducing or eliminating so-called “jaggies” produced by most current 3d graphics hardware accelerators. To achieve this high quality, the 3d graphics hardware accelerator must be able to support more complex frame buffers, in which a large number of samples must be kept for each pixel in an image that is being rendered. The architecture must also support powerful antialiasing filtering of these samples at some point before the video output signal is generated.
Most conventional 3d graphics hardware accelerators for real-time interaction either provide no support for keeping multiple samples per pixel, or support only very limited sample densities, e.g., 2 or 4, and occasionally 8. These systems also support only the most limited forms of antialiasing filtering of these samples during video output signal generation. For example, generally the antialiasing filter is limited to only a one pixel by one pixel box filter. For future systems, it is highly beneficial to support 16 samples per pixel, and 32, 48, or even 64 samples per pixel or more in advanced cases. These sample densities must be supported not only for low resolution video signal formats, e.g., NTSC, but also for high definition resolution formats, e.g., HDTV and 2 megapixel computer video signal formats. The desired signal processing is to support at least four pixel by four pixel cubic filter antialiasing filters with negative lobes, and larger area antialiasing filters, e.g., eight by eight pixels or more, in advanced cases.
Programmable shading is a technique used for decades by 3d software rendering systems, where a general purpose computer works for hours or days at a time to produce a single final rendered image. These are the systems that produce the most realistic 3d computer graphics images, and whose use is now essential in the creation of special effects of many movies. The idea is that while much of the so-called “graphics pipeline” has fixed functionality that cannot be modified, at certain “key” points in the pipeline there is the option for application specific graphics algorithms to be used. This supports more realism in the final rendered image. For example, for disaster training of police, firefighters, and paramedics, it can be very important to accurately model the effects of smoke and dust in reducing visibility for emergency workers during training scenarios. Programmable shaders have emerged as a good technique for customizing the visual output of 3d graphics hardware accelerators.
Conventional 3d graphics hardware accelerators for real-time interaction have only just started to provide very limited support for programmable shading. The most sophisticated 3d graphics hardware accelerator chip on the market today can only support eight instruction steps at the most important point in the graphics pipeline, the pixel shader, and do not allow any conditional instruction steps. This is nowhere near sufficient to give end-users the flexibility and quality they want. For future systems, it is highly desirable to be able to support much more general programmable shaders, e.g., on the order of hundreds to thousands of instructions steps, as well as conditional steps.
In conventional low-end 3d graphics hardware accelerators, e.g., those mostly aimed at the consumer home gaming market, issues of system architecture are simplified by confining most of the 3d graphics hardware accelerator to a single chip. Within a chip, issues of buses and bandwidth are less critical than they are between multiple chips, and the overall algorithms used are kept simple. As a result, it has been possible to construct reasonably powerful systems at consumer market prices, albeit limited to only the processing power of a single low cost chip.
In mid range and high end 3d graphics hardware accelerators, e.g., those aimed at the professional markets of automobile and aircraft design, medical visualizations, petrochemical visualization, general scientific visualization, flight simulation and training, digital content creation (animation and film editing), video broadcasting, etc., the customer requirements can only be met by building more complex 3d graphics hardware accelerators than will fit on a single chip, e.g., they have to use the computational power of large numbers of chips together in a system. Most all conventional systems for this market have required a large number of different custom chip types to be built, and generally use multiple different custom interconnects or buses to connect these chips together to build a functioning system. These multiple interconnects or busses are expensive to build, both in the cost of incremental pins on the chip's package, the cost of wires and connectors on the printed circuit boards, and in the cost of designing and testing several different custom crafted interconnect bus protocols. Under normal operating conditions, only a few of these interconnects or busses are operating at their peak rate; the other buses are under utilized. Thus, much of the full aggregate bandwidth of these interconnects or buses is rarely if ever used, and potentially represents wasted product engineering and/or product costs.
The current low end of the 3d graphics hardware accelerator market is very price driven, as most of the market is for home consumer 3d video game applications. These 3d graphics hardware accelerators are either sold as sub $500 PC ad-in cards, or as integral parts of sub $400 game consoles. To achieve the low parts costs implied by these prices points, most of the 3d graphics hardware accelerator architectures for these markets consist of a single graphics accelerator ASIC, to which is attached a small number of DRAM chips. Other chips, if present, are general purpose processors or audio acceleration chips, and do not directly interface to the DRAM chips containing the frame buffer and texture memory. The best case 3d rendering performance of these single graphics accelerator ASIC based systems is constrained as described before by the limits of how much bandwidth is available for 3d rendering given the limits of the number of pins that can be attached to ASICs in this price range, and the bandwidth of DRAM chips that use no more than this number of pins to attach to the ASIC. In these systems, the same attached DRAMs are used for fetching 2d textures, rendering pixels (or samples), and fetching pixels to generate the video output signal through separate analog and/or digital video output interface pins on the same graphics accelerator ASIC.
The current middle range of the 3d graphics accelerator market is still somewhat price sensitive, but is also more feature and performance sensitive. The prices for just the 3d graphics hardware accelerator add-in cards for professional PC's or workstations is in the $1800 to $6000 range. To achieve higher performance, the architecture of these 3d graphics hardware accelerators usually separates the set of DRAM chips used to store 2d and 3d textures from the set of DRAM chips that comprise the frame buffer proper. Because of the limits of how much bandwidth is available for graphics operations between the DRAMs used to store the 2d and 3d textures and a single 3d rendering ASIC, it is common in the mid range to duplicate the entire sub-system of the 3d rendering ASIC and the attached DRAMs. If this sub-system is duplicated n times, then n times more bandwidth to and from the textures is needed for rendering. Here, clearly, the trade off of higher cost was accepted in order to obtain higher performance. The bandwidth to and from the frame buffer itself also may need to be higher than that which is supportable by the pins attached to a single ASIC. Several techniques to distribute the frame buffer access across several ASIC have been developed, so that no one ASIC needs to support more than a fraction of the total bandwidth to and from the frame buffer. Varied and complex techniques have been developed to make such multiple ASIC and memory sub-system all work together to accelerate 3d rendering, and will not be covered in full detail here. The important point is that these architectures have all been driven by the need to distribute the bandwidth consumption of 3d rendering algorithms across multiple ASICs and DRAM local memory sub-systems. The resulting systems usually require several different expensive ASICs to be designed and fabricated. These systems also generally produce just one product configuration; typically it is not possible to take the same ASICs (with no changes) and build a more expensive but faster product, or a slower but less expensive product.
The current high end of the 3d graphics hardware accelerator market is much more performance and feature driven than price driven. The prices can range from $6000 (the top of the mid-range) to several hundred thousand dollars for the most powerful 3d graphics hardware accelerators. The architectures of the high end systems are related to those of the mid range systems. The same techniques of applying more ASICs and DRAMs in parallel are used, but in more extreme ways. Given the similarity, there is no need to explicitly describe existing high end systems in any more detail here.
While many measures of performance still need to improve in 3d graphics, the desired rendering frame rates are maxing out at 76 Hz, the desired resolution are maxing out at 1920×1200, depth complexity is only slowly growing past 6, and sample densities will likely stop growing at 16. What this means is that pixel fill rate is only slowly growing past 1 billion pixels per second (with a sample fill rate at 16 billion samples per second). So a scalable graphics architecture can treat pixel fill rate as a constant, rather than something to be scaled.
Additionally, while frame buffer storage that can't be written into with a pixel fill rate of 6× the video output signal video format pixel rate and read out at the same 6× rate is still unusable as storage, it is not unusable for texture storage. Applications want all sorts of texture to be available for immediate use during rendering, but on any given frame only a small sub-set of the texture is actually accessed. So if a high end architecture can do what happened by coincidence in low end architectures, e.g., arrange to have both the texture storage and frame buffer storage in the same memory bank, DRAM could be efficiently used.