1. Field of the Invention
The present invention relates generally to the field of computer graphics processing, and more particularly to the field of computer graphics command transport.
2. Description of Background Art
Recent advances in computer performance have enabled graphics systems to provide more realistic graphical images using personal computers and home video game computers. Such computers include a central processing unit (CPU), memory, and graphics processors. Application programs--such as games, CAD packages, and visualization systems, for example--are executed by the CPU with the graphics processors through special software referred to commonly as "device drivers" (or "drivers") to direct the graphics processor to perform specific functions. The driver translates the application's abstract notion of a device function--such as drawing a shape, e.g., a triangle, on the screen at a particular location with particular attributes such as color and texture--into a sequence of words and stores these words from the CPU to the device over a data bus. These words are frequently sent over the data bus as a sequence of (address, data) pairs where the address indicates which device (and possibly a particular location inside of the device) the data is intended for. When the words that form a complete command have been received by the device, the device carries out (or "executes") the corresponding function. There are many different ways for drivers to transmit and for devices to accept (address, data) pairs representing commands and/or data--complex devices such as three-dimensional (3D) graphics subsystems implement commands that require more than a single word of data to describe. For example, a 3D graphics device could implement a triangle rendering command that includes sending the following sequence of words:
______________________________________ Address Data ______________________________________ A0 Draw Triangle command (DT) A1 X coordinate of vertex #1 (X1) A2 Y coordinate of vertex #1 (Y1) A3 X coordinate of vertex #2 (X2) A4 Y coordinate of vertex #2 (Y2) A5 X coordinate of vertex #3 (X3) A6 Y coordinate of vertex #3 (Y3) ______________________________________
For the graphics processor (GP), e.g., the graphics engine, to correctly execute this function, it must first recognize the (A0, DT) word. This word instructs the graphics processor to execute a draw triangle function, by using the three pairs of (X,Y) data words sent via addresses A1 through A6. The graphics processor must provide some way for the vertex data to be correctly interpreted--e.g., the graphics processor must not mistakenly use Y2 for the value of the first vertex's X coordinate even if out-of-order writes from the CPU 102 delivers Y2 ahead of either X1 or X2.
The particular address values used in the (address, data) pairs accepted by the graphics processor, together with the method that the driver transmits these (address, data) pairs to the graphics processor is one example of the command and data transport method of the graphics processor. Graphics processors typically use 1 of 2 different transport methods.
The first method is the fixed addresses/hidden buffer method. In this method of transport, data is sent to one of potentially multiple different addresses, depending on how the graphics processor is meant to interpret the data. For the above example, the graphics processor would have unique addresses assigned to each of the first vertex's x coordinate (V1X), the first vertex's y coordinate (V1Y), the second vertex's x coordinate (V2X), the second vertex's y coordinate (V2Y), the third vertex's x coordinate (V3X), the third vertex's y coordinate (V3Y), and a "command" address (CMD). These separate words of memory that are used for unique purposes are stored on a graphics processor in memory words called "registers." When the graphics processor receives a word in the command register whose data value indicates the draw triangle function, it immediately executes a draw triangle function using the data values already contained in the (V1X, V1Y, . . . , V3Y) registers. Thus, in the above example, the order that the graphics processor receives the vertex words is unimportant, however, they must all be received before the command word, because the graphics processor will only use the values present in the vertex registers at the time the command word is received. For increased performance between CPU and graphics processor, the graphics processor may buffer the (address, data) pairs, e.g., a first-in-first-out buffer, so that the graphics processor can simultaneously accept new commands from the CPU while executing a drawing command. Typically, this buffer is not directly accessible by the CPU.
The second method is the no fixed addresses/exposed buffer method. In this method, unlike the first method, no fixed addresses are used. For increased performance, the first method uses a memory buffer to buffer the commands, which allows the CPU to queue new commands into the front of the buffer while the graphics processor is simultaneously reading and processing commands from the back of the buffer. In the exposed memory buffer method, the entire buffer (that was hidden from direct CPU access in the first method) can be written to directly by the CPU. This buffer begins at some address A, and continues with addresses A+1, A+2, . . . , A+N-1, where "N" is the size in words of the buffer. The graphics processor interprets words in increasing order of address, reading each new command from the buffer address following the last word of the previous command. For example, if the previous command ended at address B, the graphics processor reads the words describing the next command starting at address B+1. Since the graphics processor may implement commands of differing word sizes, any given command could begin anywhere in the buffer. When the graphics processor reads the last word in the buffer, the graphics processor starts reading from the beginning of the buffer again. For example, when address A+N is reached, the next address the graphics processor reads from is A, then A+1, A+2, etc. Thus, in the above example, the driver would transmit and the graphics processor would interpret the (address, data) pairs like this: (B, DT), (B+1, X1), (B+2, Y1), (B+3, X2), (B+4, Y2), (B+5, X3), (B+6, Y3), where B-1 was the address of the last word of the previous command. The graphics processor uses the first word (or words) of the command (B, DT) to determine the number and interpretation of the words that follow that make up the data for the command. Typically, the CPU writes all of the words describing a command to the buffer, then directs the graphics processor to execute that command.
Both of these transport methods work reasonably well on CPUs that do not perform write reordering. On CPUs that do perform write reordering, however, both methods encounter problems. Some CPUs, e.g., the Pentium Pro Processor that is commercially available from Intel Corp., Santa Clara, Calif., use write reordering. A description of write reordering is now set forth.
In the course of running a software application program, the software running on the CPU issues instructions for the CPU to read and write data to various addresses in memory. These memory access instructions are issued in a particular order by the software, and on CPUs with no reordering these instructions are carried out in order. For example, if an application program issues these memory accesses (in top to bottom order):
write "1" to PA1 address A PA1 write "2" to PA1 address B PA1 write to address A PA1 write to address B PA1 MB--memory barrier PA1 write to address C PA1 write to address D
then the CPU will first store the value "1" to address A, and then will store the value "2" to address B. However, some CPUs allow a range or ranges of memory to be marked such that writes to addresses within that range issued from the application may actually occur in the memory system in an order different from that issued by software. For example, the Intel Pentium Pro and Pentium II CPUs offer the ability to mark a range of addresses as "USWC" (Uncached Speculative Write Combine). Stores to memory addresses marked USWC occur in an undefined order. In the above example, this type of CPU could either store to address A then address B, or first to address B and then to address A. Without taking extra measures to prevent this reordering, as described below, the CPU offers the application software no guarantee on what order the writes will actually occur in. By rearranging the order of writes to memory, the CPU can optimize memory performance by issuing the writes to memory such that they will occur in the least amount of time. Normally such write reordering by the CPU is not a problem for software that writes to regular system RAM, but this causes significant problems on driver software storing commands to a graphics processor.
For example, the draw triangle command, described above, which is issued using the first transport method may not operate properly if the memory writes are performed out of order. For example, in the first transport method the driver sends all of the vertex data to the vertex coordinate addresses, followed by a write to the command address to start the graphics processor processing the triangle command. If the CPU reorders this sequence of writes to store to the graphics processor's command address before any of the stores to the graphics processor's vertex data addresses, the graphics processor will process the draw triangle command using partially (or totally, depending upon how the CPU reordered the writes) incorrect vertex data, thus producing an incorrect image.
Similarly, in the second transport method, the driver sends the draw triangle command word first, followed by the vertex data, to successive addresses in the command buffer, followed by a write to a "begin processing" address that tells the graphics processor that a new command is available at the next location in the command buffer. If the write to the graphics processor's "begin processing" address is reordered by the CPU to occur before any of the other words are written to the command buffer, the graphics processor will immediately begin reading and processing the values contained in the command buffer. Since the CPU has not yet written some or all of the actual command and data words for the draw triangle command before the graphics processor begins processing the command, the graphics processor will read and process "old" values that had been previously stored at those command buffer addresses on the previous pass through the command buffer. Even worse, if the graphics processor reads and processes an "old" data word that does not represent a legal graphics processor command, it's likely that the graphics processor will end up "hanging" or not responding to new commands, making the entire computer system appear to be "stuck."
Having the graphics processor produce bad images or hang the computer system are unacceptable behaviors. The driver software on the CPU and the graphics processor must together guarantee completely correct functionality. Some conventional solutions to these problems are described below.
A first solution is to disable CPU write reordering. This solution requires no graphics processor changes to support, however, this is an unacceptable solution for several reasons. First, on CPUs like the Pentium Pro and Pentium II processors, not using write reordering substantially reduces effective graphics processor performance (on the Pentium Pro, the maximum data throughput to a graphics processor using write reordering can be more than three times faster than when write reordering is disabled). Second, there are CPUs that cannot disable write reordering (e.g., the early "Alpha" processors from Digital Equipment Corp.), rendering this solution useless.
A second solution is to use CPU write reordering but with memory synchronization instructions or techniques. Again, this solution does not require costly graphics processor additions. Both the Intel and the DEC CPUs that perform write reordering include special instructions and methods known as "fences" or "memory barriers" (MBs) that, when used, will cause the CPU to guarantee that all memory writes that are issued before the memory barrier instruction will occur before any memory write issued after the memory barrier instruction. For example, in the following sequence:
the CPU guarantees that both of addresses A and B will be written before either of addresses C or D (but A could still be written before or after B, and C can be written before or after D). Frequent use of memory barriers solve the above problems by guaranteeing that the CPU issues data to the graphics processor in the correct order. However, invoking memory barriers on both the DEC and Intel CPUs can take a great many CPU cycles to perform, and are thus extremely expensive in terms of performance. Even if the cost of the memory barrier itself were low in terms of CPU cycles, issuing frequent memory barrier instructions interferes with the CPU's write reordering mechanism, which can significantly reduce performance.
A third solution is to use command buffers stored in CPU memory and have the graphics processor process the command buffers directly from CPU memory. Having the graphics processor "pull" the data across the bus instead of having the CPU "push" it across the bus to the graphics processor insures that the graphics processor will receive the data in the correct order. However, this technique has serious drawbacks, including: 1) it requires the graphics processor to implement "bus mastering" logic (the logic that gives the graphics processor the ability to read data from CPU memory) which is expensive both in terms of size and added complexity to the graphics processor design; 2) it only works on bus technology that supports device bus mastering; 3) it does not remove the need for memory barriers when writing to the command buffer in host memory; and 4) performance is reduced due to the increased traffic to the CPU memory system.
A fourth solution is to have the graphics processor sort data words using a temporary, separate sort buffer. While this solution, when used together with the second method, potentially requires fewer memory barriers than the other solutions, it too has serious drawbacks, including: 1) separate sort buffers are expensive to implement in a graphics processor, requiring extra memory and complex logic; 2) the driver must still issue memory barriers after a number of words equal to the size of the sort buffer are written to the graphics processor. The larger the sort buffer, the fewer the memory barriers required, but the graphics processor becomes larger, more complex, and thus more expensive. The smaller the sort buffer, the more frequently the software must perform memory barrier instructions, thereby reducing performance.
What is needed is a system and method for enabling a graphics processor to operate with a CPU that reorders write instructions without requiring expensive hardware and which does not significantly reduce the performance of the CPU.