The present invention relates in general to memory management, and in particular to on-demand memory synchronization for a peripheral system with multiple parallel processors.
Graphics processing subsystems are designed to render realistic animated images in real time, e.g., at 30 or more frames per second. These subsystems are most often implemented on expansion cards that can be inserted into appropriately configured slots on a motherboard of a computer system and generally include one or more dedicated graphics processing units (GPUs) and dedicated graphics memory. The typical GPU is a highly complex integrated circuit device optimized to perform graphics computations (e.g., matrix transformations, scan-conversion and/or other rasterization techniques, texture blending, etc.), write the resulting pixels to the graphics memory, and deliver the pixels in real time to a display device. The GPU is a “slave” processor that operates in response to commands received from a driver program executing on a “master” processor, generally the central processing unit (CPU) of the system.
To meet the demands for realism and speed, some GPUs include more transistors than typical CPUs. In addition, graphics memories have become quite large in order to improve speed by reducing traffic on the system bus; some graphics cards now include as much as 256 MB of memory. But despite these advances, a demand for even greater realism and faster rendering persists.
As one approach to meeting this demand, some manufacturers have begun to develop “multi-chip” (or multi-processor) graphics processing subsystems in which two or more GPUs, usually on the same card, operate in parallel. Parallel operation substantially increases the number of rendering operations that can be carried out per second without requiring significant advances in GPU design. To minimize resource conflicts between the GPUs, each GPU is generally provided with its own dedicated memory area, including a display buffer to which the GPU writes pixel data it renders.
In a multi-chip system, two or more GPUs can be operated to render images cooperatively for the same display device; in this “distributed” rendering mode, rendering tasks are distributed among the GPUs. Tasks may be distributed in various ways. For example, in a “split frame rendering” (SFR) mode, each GPU is instructed to render pixel data for a different portion of the displayable image, such as a number of lines of a raster-based display. The image is displayed by scanning out the pixel data from each GPU's display buffer in an appropriate sequence. As another example, in an “alternate frame rendering” (AFR) mode, each GPU is instructed to render pixel data for a different image in a temporal sequence (e.g., different frames of an animated image such as a 3D video game). In this mode, a smooth animation speed of about 30 frames per second can be provided by two GPUs that each render images at 15 Hz.
Multi-chip graphics systems present a variety of problems, among which is high demand for memory bandwidth. In a multiprocessor system, each GPU is usually provided with a separate memory in order to avoid memory access conflicts between the GPUs. Each memory must be loaded with the scene data (geometry, lighting, textures, etc.) and rendering commands for the image. In conventional systems, the CPU communicates rendering commands and data to each GPU (or its memory) separately over a CPU bus. Thus, the bandwidth required to operate multiple GPUs scales linearly with the number of GPUs. For complex scenes requiring extensive vertex data, operating even two GPUs in parallel would saturate the CPU bus, and no performance advantage would be obtained from having multiple GPUs. In addition, requiring the CPU to communicate the same rendering commands and data multiple times wastes CPU cycles.
One solution, described in above-referenced application Ser. No. 10/789,248, is to provide a “broadcast bridge” and memory-mapped I/O (input/output) operations. As is known in the art, in memory mapped I/O, commands and data written by the CPU to a predetermined range of memory addresses are communicated to a peripheral device (e.g., the graphics subsystem), and read operations by the CPU that access the predetermined range of memory addresses result in reading data from the peripheral device. The broadcast bridge is configured to recognize a certain range of memory addresses as a “broadcast aperture” and to distribute to all of the GPUs any commands and data written by the CPU to an address in the broadcast aperture.
The broadcast bridge solves the problem of duplicative CPU communications, provided that suitable supporting hardware is available, e.g., in one or more of the GPUs, in the system chipset, or elsewhere in a communication path between the CPU and the GPUs. In some systems, however, suitable supporting hardware is not present, and an efficient technique for distributing data and commands to multiple GPUs in the absence of a broadcast bridge is therefore desirable.