For a variety of reasons, it is desirable for an application program running in a central processing unit (CPU) to be able to rapidly transfer data to or from a graphics processing unit (GPU). For example, the application program may have image data that it wants the GPU to render. Alternatively, the application program may desire to receive pixel data that the GPU generated. Because a GPU is better at generating pixel data than is the CPU, this allows the GPU to be used as a co-processor. These data transfers should be executed as rapidly as possible for efficient rendering and for efficient use of the GPU as a co-processor.
However, conventionally it is difficult to perform an asynchronous Direct Memory Access (DMA) between the GPU and system memory. Conventionally, time consuming additional steps are taken to set up the DMA, thus slowing the overall data transfer rate. The GPU typically requires that any DMA data transfers between it and system memory must at least appear to involve a contiguous range of system memory addresses. However, it is extremely unlikely that system memory will have a contiguous range of memory addresses available for the application program. Conventionally, a graphics aperture re-mapping table (GART) is used to make it appear to the GPU that the data involved in the DMA reside in a contiguous range of memory. FIG. 1A illustrates a memory space 102 including 256 MB of main (physical memory) and a 128 MB Accelerated Graphics Port (AGP) aperture region. The main memory has allocated therein several 4K byte pages, which are allocated during AGP memory initialization and are typically noncontiguous as illustrated. The GART 104 maps the noncontiguous pages in main memory to contiguous pages in the AGP aperture region. Thus, it appears to the GPU that the data is located over a contiguous range of addresses.
However, in order to take advantage of the GART table 104, some conventional techniques require a transfer of data to the region of system memory that was allocated for AGP transfers to a region of system memory that is allocated to the application program for other purposes. This transfer requires several physical data transfers between system memory and the CPU. FIG. 1B illustrates a conventional write to the GPU, in which first the application program copies the data from the CPU 110 to a location in system memory 120 that belongs to the application program. The application program then executes an OpenGL® call to initiate the transfer of the data from application space 121 in system memory 120 to the GPU 140 via the northbridge 130. OpenGL® is a cross-platform standard for 3D rendering and 3D hardware acceleration. Next, the data is transferred from the application space 121 in system memory 120 to the CPU 110 via the northbridge 130. Then, the data is transferred back to a location 123 in system memory 120 that is available to the GART. Finally, an asynchronous DMA is executed to transfer the data to the GPU 140 from the location 123 in system memory 120 that is available to the GART.
Therefore, this sequence requires four separate data transfers. Moreover, three of the data transfers occur on the front-side bus between the CPU 110 and system memory. For many systems, the front-side bus data transfers are considerably slower than the data transfer between system memory and the GPU 140.
A similar problem occurs if the application program desires to read data from the GPU 140. FIG. 2 illustrates a conventional read of data from a GPU 140 to the CPU 110. The sequence starts with the application program making an OpenGL® call to initiate the reading of the image data from the GPU's memory 150. In response to a command from the northbridge 130, the GPU 140 then transfers the image data 125 to a location 123 in system memory 120 to which it has access. This data transfer may be a PCI (Peripheral Component Interconnect) transfer or an AGP transfer. After the data transfer is completed, the northbridge 130 transfers the image data to the CPU 110 from system memory 120. Then, the northbridge 130 transfers the image data back to system memory 120, in an application space 121 of system memory 120. Finally, the application program reads the image data. Thus, again there are three transfers between the CPU 110 and system memory 120.
One reason for the additional and time-consuming transfers between the CPU and system memory is that the extra transfers avoid a call to the operating system to lock-down the system memory involved in the DMA. While a call to the operating system can be made from the application program each time a data transfer is desired, this consumes substantial time relative to the amount of time that is consumed by the extra data transfers. Thus, it is undesirable for the application program to call the operating system to lock-down the application program's memory.
Therefore, it would be advantageous to provide rapid data transfer between an application program and a hardware device, such as a GPU. It would be further advantageous if the method does not require frequent calls from the application program to the operating system to lock down application memory. It would be still further advantageous if the method does not require multiple data transfers between system memory and the CPU.