1. Field of the Invention
The present invention is generally directed to computing operations performed in computing systems.
2. Background Art
A conventional computing system includes a plurality of hardware components, such as a central processing unit (CPU) and a graphics processing unit (GPU). The CPU is an integrated circuit (IC) that coordinates the operations of all the other devices of the computing system. A GPU is an integrated circuit that is adapted to perform data-parallel computing tasks, such as graphics-processing tasks. A GPU may, for example, execute graphics-processing tasks required by an end-user application, such as a video-game application.
A conventional computing system also includes system memory, such as random access memory (RAM). Typically, the CPU and GPU each have access to the system memory. In addition to the system memory, the GPU may also be coupled to a local memory.
Unfortunately, CPU reads to GPU local memory are slow. Specifically, reads are performed uncached (UC), meaning that the data that is read is not copied into a local cache memory. Also, all uncached reads are 32 or 64 bits wide and serialized, meaning that the CPU only issues one read request at a time, and waits for the data from the read request to return data prior to issuing another read request. As a result, CPU reads to GPU local memory are conventionally slow.
What is needed, therefore, are systems, apparatuses, and methods for enabling a first processing unit (e.g., CPU) to quickly read a local memory of a second processing unit (e.g., GPU).