The demand for ever higher performance in computer graphics has lead to the continued development of more and more powerful graphics processing subsystems and graphics processing units (GPUs). However, it may be desirable to achieve performance increases by modifying and/or otherwise utilizing existing graphics subsystems and GPUs. For example, it may be more cost effective to obtain performance increases by utilizing existing equipment, instead of developing new equipment. As another example, development time associated with obtaining performance increases by utilizing existing equipment may be significantly less, as compared to designing and building new equipment. Moreover, techniques for increasing performance utilizing existing equipment may be applied to newer, more powerful graphics equipment when it become available, to achieve further increases in performance.
On approach for obtaining performance gains by modifying or otherwise utilizing existing graphics equipment relates to the use of multiple GPUs to distribute the processing of images that would otherwise be processed using a single GPU. While the use of multiple GPUs to distribute processing load and thereby increase overall performance is a theoretically appealing approach, a wide variety of challenges must be overcome in order to effectively implement such a system. To better illustrate the context of the present invention, description of a typical computer system employing a graphics processing subsystem and a GPU is provided below.
FIG. 1 is a block diagram of a computer system 100 that includes a central processing unit (CPU) 102 and a system memory 104 communicating via a bus 106. User input is received from one or more user input devices 108 (e.g., keyboard, mouse) coupled to bus 106. Visual output is provided on a pixel based display device 110 (e.g., a conventional CRT or LCD based monitor) operating under control of a graphics processing subsystem 112 coupled to system bus 106. A system disk 107 and other components, such as one or more removable storage devices 109 (e.g., floppy disk drive, compact disk (CD) drive, and/or DVD drive), may also be coupled to system bus 106. System bus 106 may be implemented using one or more of various bus protocols including PCI (Peripheral Component Interconnect), AGP (Advanced Graphics Processing) and/or PCI-Express (PCI-E); appropriate “bridge” chips such as a north bridge and south bridge (not shown) may be provided to interconnect various components and/or buses.
Graphics processing subsystem 112 includes a graphics processing unit (GPU) 114 and a graphics memory 116, which may be implemented, e.g., using one or more integrated circuit devices such as programmable processors, application specific integrated circuits (ASICs), and memory devices. GPU 114 includes a rendering module 120, a memory interface module 122, and a scanout module 124. Rendering module 120 may be configured to perform various tasks related to generating pixel data from graphics data supplied via system bus 106 (e.g., implementing various 2-D and or 3-D rendering algorithms), interacting with graphics memory 116 to store and update pixel data, and the like. Rendering module 120 is advantageously configured to generate pixel data from 2-D or 3-D scene data provided by various programs executing on CPU 102. Operation of rendering module 120 is described further below.
Memory interface module 122, which communicates with rendering module 120 and scanout control logic 124, manages interactions with graphics memory 116. Memory interface module 122 may also include pathways for writing pixel data received from system bus 106 to graphics memory 116 without processing by rendering module 120. The particular configuration of memory interface module 122 may be varied as desired, and a detailed description is omitted as not being critical to understanding the present invention.
Graphics memory 116, which may be implemented using one or more integrated circuit memory devices of generally conventional design, may contain various physical or logical subdivisions, such as a pixel buffer 126 and a command buffer 128. Pixel buffer 126 stores pixel data for an image (or for a part of an image) that is read and processed by scanout module 124 and transmitted to display device 110 for display. This pixel data may be generated, e.g., from 2-D or 3-D scene data provided to rendering module 120 of GPU 114 via system bus 106 or generated by various processes executing on CPU 102 and provided to pixel buffer 126 via system bus 106. In some implementations, pixel buffer 126 can be double buffered so that while data for a first image is being read for display from a “front” buffer, data for a second image can be written to a “back” buffer without affecting the currently displayed image. Command buffer 128 is used to queue commands received via system bus 106 for execution by rendering module 120 and/or scanout module 124, as described below. Other portions of graphics memory 116 may be used to store data required by GPU 114 (such as texture data, color lookup tables, etc.), executable program code for GPU 114 and so on.
Scanout module 124, which may be integrated in a single chip with GPU 114 or implemented in a separate chip, reads pixel color data from pixel buffer 118 and transfers the data to display device 110 to be displayed. In one implementation, scanout module 124 operates isochronously, scanning out frames of pixel data at a prescribed refresh rate (e.g., 80 Hz) regardless of any other activity that may be occurring in GPU 114 or elsewhere in system 100. Thus, the same pixel data corresponding to a particular image may be repeatedly scanned out at the prescribed refresh rate. The refresh rate can be a user selectable parameter, and the scanout order may be varied as appropriate to the display format (e.g., interlaced or progressive scan). Scanout module 124 may also perform other operations, such as adjusting color values for particular display hardware and/or generating composite screen images by combining the pixel data from pixel buffer 126 with data for a video or cursor overlay image or the like, which may be obtained, e.g., from graphics memory 116, system memory 104, or another data source (not shown). Operation of scanout module 124 is described further below.
During operation of system 100, CPU 102 executes various programs that are (temporarily) resident in system memory 104. These programs may include one or more operating system (OS) programs 132, one or more application programs 134, and one or more driver programs 136 for graphics processing subsystem 112. It is to be understood that, although these programs are shown as residing in system memory 104, the invention is not limited to any particular mechanism for supplying program instructions for execution by CPU 102. For instance, at any given time some or all of the program instructions for any of these programs may be present within CPU 102 (e.g., in an on-chip instruction cache and/or various buffers and registers), in a page file or memory mapped file on system disk 128, and/or in other storage space.
Operating system programs 132 and/or application programs 134 may be of conventional design. An application program 134 may be, for instance, a video game program that generates graphics data and invokes appropriate rendering functions of GPU 114 (e.g., rendering module 120) to transform the graphics data to pixel data. Another application program 134 may generate pixel data and provide the pixel data to graphics processing subsystem 112 for display. It is to be understood that any number of application programs that generate pixel and/or graphics data may be executing concurrently on CPU 102. Operating system programs 132 (e.g., the Graphical Device Interface (GDI) component of the Microsoft Windows operating system) may also generate pixel and/or graphics data to be processed by graphics card 112.
Driver program 136 enables communication with graphics processing subsystem 112, including both rendering module 120 and scanout module 124. Driver program 136 advantageously implements one or more standard application program interfaces (APIs), such as Open GL, Microsoft DirectX, or D3D for communication with graphics processing subsystem 112; any number or combination of APIs may be supported, and in some implementations, separate driver programs 136 are provided to implement different APIs. By invoking appropriate API function calls, operating system programs 132 and/or application programs 134 are able to instruct driver program 136 to transfer geometry data or pixel data to graphics card 112 via system bus 106, to control operations of rendering module 120, to modify state parameters for scanout module 124 and so on. The specific commands and/or data transmitted to graphics card 112 by driver program 136 in response to an API function call may vary depending on the implementation of GPU 114, and driver program 136 may also transmit commands and/or data implementing additional functionality (e.g., special visual effects) not controlled by operating system programs 132 or application programs 134.
In some implementations, command buffer 128 queues the commands received via system bus 106 for execution by GPU 114. More specifically, driver program 136 may write one or more command streams to command buffer 128. A command stream may include rendering commands, data, and/or state commands, directed to rendering module 120 and/or scanout module 124. In some implementations, command buffer 128 may include logically or physically separate sections for commands directed to rendering module 120 and commands directed to display pipeline 124; in other implementations, the commands may be intermixed in command buffer 128 and directed to the appropriate pipeline by suitable control circuitry within GPU 114.
Command buffer 128 (or each section thereof) is advantageously implemented as a first in, first out buffer (FIFO) that is written by CPU 102 and read by GPU 114. Reading and writing can occur asynchronously. In one implementation, CPU 102 periodically writes new commands and data to command buffer 128 at a location determined by a “put” pointer, which CPU 102 increments after each write. Asynchronously, GPU 114 may continuously read and process commands and data sets previously stored in command buffer 128. GPU 114 maintains a “get” pointer to identify the read location in command buffer 128, and the get pointer is incremented after each read. Provided that CPU 102 stays sufficiently far ahead of GPU 114, GPU 114 is able to render images without incurring idle time waiting for CPU 102. In some implementations, depending on the size of the command buffer and the complexity of a scene, CPU 102 may write commands and data sets for frames several frames ahead of a frame being rendered by GPU 114. Command buffer 128 may be of fixed size (e.g., 5 megabytes) and may be written and read in a wraparound fashion (e.g., after writing to the last location, CPU 102 may reset the “put” pointer to the first location).
In some implementations, execution of rendering commands by rendering module 120 and operation of scanout module 124 need not occur sequentially. For example, where pixel buffer 126 is double buffered as mentioned previously, rendering module 120 can freely overwrite the back buffer while scanout module 124 reads from the front buffer. Thus, rendering module 120 may read and process commands as they are received. Flipping of the back and front buffers can be synchronized with the end of a scanout frame as is known in the art. For example, when rendering module 120 has completed a new image in the back buffer, operation of rendering module 120 may be paused until the end of scanout for the current frame, at which point the buffers may be flipped. Various techniques for implementing such synchronization features are known in the art, and a detailed description is omitted as not being critical to understanding the present invention.
The system described above is illustrative, and variations and modifications are possible. A GPU may be implemented using any suitable technologies, e.g., as one or more integrated circuit devices. The GPU may be mounted on an expansion card, mounted directly on a system motherboard, or integrated into a system chipset component (e.g., into the north bridge chip of one commonly used PC system architecture). The graphics processing subsystem may include any amount of dedicated graphics memory (some implementations may have no dedicated graphics memory) and may use system memory and dedicated graphics memory in any combination. In particular, the pixel buffer may be implemented in dedicated graphics memory or system memory as desired. The scanout circuitry may be integrated with a GPU or provided on a separate chip and may be implemented, e.g., using one or more ASICs, programmable processor elements, other integrated circuit technologies, or any combination thereof. In addition, GPUs embodying the present invention may be incorporated into a variety of devices, including general purpose computer systems, video game consoles and other special purpose computer systems, DVD players, handheld devices such as mobile phones or personal digital assistants, and so on.
While a modern GPU such as the one described above may efficiently process images with remarkable speed, there continues to be a demand for ever higher graphics performance. By using multiple GPUs to distribute processing load, overall performance may be significantly improved. However, implementation of a system employing multiple GPUs relates to significant challenges. Of particular concern is the coordination of the operations performed by various GPUs. The present invention provides innovative techniques related to the timing of GPU operations relevant in a multiple GPU system.