A graphics processing unit (GPU) may be used for non-graphics related tasks, for example, general purpose computing on a GPU (GPGPU). In the GPGPU scenario, computing tasks may be offloaded from a related central processing unit (CPU) to the GPU. As such, the GPU may need to access memory that is external to the GPU. FIGS. 1A, 1B, and 1C show three different configurations for a GPU and a CPU. It is noted that these configurations are exemplary and that other configurations of a GPU and a CPU sharing a common memory are possible.
FIG. 1A shows a first configuration 100. In the configuration 100, a CPU 102 communicates with a system memory 104 and an input/output memory management unit (IOMMU) 106. The IOMMU 106 communicates with a GPU 108. In the configuration 100, the GPU 108 may access the system memory 104 via the IOMMU 106. The IOMMU 106, with the proper supporting software, additionally provides the option of references to the system memory 104 to be non-resident. Usually, in the configuration 100, a CPU application first registers blocks of virtual memory data. As long as these blocks are resident in the system memory 104, whenever such an address is requested by the GPU 108, the GPU 108 sends an Address Translation Service (ATS) request to the IOMMU 106 to obtain the corresponding physical memory address.
If the blocks are not resident in the system memory 104, dedicated CPU software handles this request (for example, a Peripheral Page Service Request (PPR)), verifies that the corresponding blocks are swapped into the system memory 104, returns control to the IOMMU 106, and lastly returns control to the GPU 108. In the case where the blocks are not resident in the system memory 104, the relatively long turnaround time may reduce the overall GPU performance, unless the proportion of PPR requests out of the general number of system memory requests originated by the GPU is relatively low.
For the system memory 104, the IOMMU 106 handles the paging for the GPU 108. The IOMMU 106 translates memory accesses (e.g., a virtual address) to a physical address per the GPU's request. If the translation is successful, the GPU 108 receives the physical memory address from the IOMMU 106 and can then request access to the physical address through the CPU memory controller (not shown in FIG. 1A). But this process requires a lot of overhead, even if there is a hit during the translation (i.e., the data associated with the translated address is immediately available—e.g., in cache). If the virtual address requested is not resident in memory, then it needs to be fetched from secondary storage, which involves an even longer delay, because the memory access must now be serialized through the CPU 102 and eliminates the benefits of the parallel performance of the GPU 108.
FIG. 1B shows a second configuration 120, in which a motherboard 122 and a video card 124 communicate via a bus 126. The motherboard 122 includes a CPU 130 and system memory 132. The video card 124 includes a GPU 134 and GPU memory 136. The GPU 134 may access the local GPU memory 136 or may access the system memory 132 over the bus 126. In an alternate implementation, the data needed by the GPU 134 may be transferred from the system memory 132 over the bus 126 to the GPU memory 136.
FIG. 1C shows a third configuration, of an accelerated processing unit (APU) 140, which includes a CPU core 142 and a GPU core 144 resident on the same die, and which share a memory 146. It is noted that while the memory 146 is shown separate from the APU 140, in some implementations, the memory 146 may also be located on the same die as the APU 140.
In a CPU, there is an issue known as the “paging problem.” There is a given amount of memory (either on-chip or off-chip) and occasionally, data from the main memory needs to be swapped out (i.e., paged out) to a secondary storage (disk, etc.). This may occur when an application has been inactive for a period of time. When the application is activated again and needs memory space, the previously swapped out data is swapped back into the memory from the secondary storage. The “problem” is that all of this swapping into and out of memory requires a lot of overhead.
Known algorithms on the CPU attempt to guess what will be the next memory area that will be used, and the data stored in that next memory area should not be swapped out of memory to the secondary storage, if possible. Alternatively, this information may be used to “read ahead,” to load the data into memory from the secondary storage before it will actually be needed.
For a CPU and memory configuration, there are a variety of methods to handle similar cases, and are generally referred to as paging algorithms. A paging algorithm is characterized by the set of criteria it uses to determine which page blocks will be swapped into or out of the memory at given points in time, whenever the amount of physical memory is expected to be limited relative to the total current or predicted demand. A common paging algorithm is known as “least recently used,” and swaps out page blocks which have not been used in a given time interval. Other paging algorithms are also common.
In a CPU environment, it is in general hard (and often almost impossible) to predict which page blocks are going to be accessed in the next given period of time. On a GPU, this problem is more acute than on a CPU. With a discrete GPU (DGPU; GPU on a separate card or location on a motherboard and the CPU on the motherboard, such as shown in FIG. 1B), there is local memory (on the DGPU card) and system memory (on the motherboard) that may be accessed by the DGPU. With an APU (such as shown in FIG. 1C), both the GPU and the CPU are on the motherboard and use the same physical memory. Some of this memory is reserved for the GPU, but most memory is allocated to the CPU. Currently, there is no hardware or hardware-supported memory paging mechanism in the GPU for the local memory.