1. Field of the Invention
The present invention is generally directed to computing operations performed in computer systems. More particularly, the present invention is directed to a processing unit, such as a graphics-processing unit (GPU), that performs computing operations and applications thereof.
2. Background
A GPU is a complex integrated circuit that is adapted to perform data-parallel computing tasks, such as graphics-processing tasks. A GPU may, for example, execute graphics-processing tasks required by an end-user application, such as a video-game application. The GPU may be a discrete (i.e., separate) device and/or package or may be included in the same device and/or package as another processor (e.g., a central processing unit (CPU)). For example, GPUs are frequently integrated into routing or bridge devices such as, for example, Northbridge devices.
There are several layers of software between the end-user application and the GPU. The end-user application communicates with an application-programming interface (API). An API allows the end-user application to output graphics data and commands in a standardized format, rather than in a format that is dependent on the GPU. Several types of APIs are commercially available, including DirectX® developed by Microsoft Corporation of Redmond, Wash. and OpenGL® promulgated by the Khronos Group. The API communicates with a driver. The driver translates standard code received from the API into a native format of instructions understood by the GPU. The driver is typically written by the manufacturer of the GPU. The GPU then executes the instructions from the driver.
A graphics-processing task performed by a GPU typically involves complex mathematical computations, such as matrix and vector operations. To perform a single graphics-processing task, a GPU may execute a plurality of different threads (sequence of instructions). Each thread may comprise a shader program, such as a geometry shader, a pixel shader, a vertex shader, or the like. Each thread (e.g., shader program) is typically associated with a set of state data (such as texture handles, shader constants, transform matrices, or the like) that is locally stored in data-storage units of the GPU. The locally stored state data is called a context.
To efficiently execute the various threads (e.g., shader programs) of a single graphics-processing task, the GPU includes an array of processing elements, called a shader core. The array of processing elements is organized into single-instruction, multiple-data (SIMD) devices. Multiple threads (e.g., shader programs) may be issued to the shader core at the same time, with the data needed to execute each thread (e.g., shader program) being distributed in parallel to different processing elements of the shader core. The different processing elements may then perform operations on the data in parallel. In this way, a GPU can perform the complex mathematical computations required for a graphics-processing task more quickly than a typical central-processing unit (CPU). As a result, if a computing system includes a GPU, graphics-processing tasks (and other types of data-parallel processing tasks) are typically passed to the GPU, rather than the CPU.
To pass tasks to the GPU, an operating-system (OS) scheduler stores the tasks in a command buffer. A conventional GPU processes one command buffer at a time. The OS scheduler serially places tasks in the command buffer, and the GPU typically processes the tasks in the order in which they are placed in the command buffer. In some instances, however, the GPU may process tasks out of the order in which they were placed in the command buffer. For example, the GPU may interrupt the execution of a first task to execute a more-important (e.g., low-latency) task that was placed in the command buffer after the first task.
To perform the more-important (e.g., low-latency) task before the first task has finished completing in the shader core of the GPU, a conventional GPU performs a context switch. That is, the state data associated with the threads of the first task are swapped into back-up storage units maintained by the conventional GPU, and new state data associated with the threads (e.g., shader programs) of the more-important (e.g., low-latency) task are retrieved and placed in the data-storage units of the shader core. The shader core then executes the threads (e.g., shader programs) of the more-important (e.g., low-latency) task based on the new state data in the data-storage units. After the more-important (e.g., low-latency) task finishes executing, the state data associated with the threads of the more-important (e.g., low-latency) task are flushed from the data-storage units, and the state data from the threads of the first task are swapped back into the data-storage units of the shader core. Then, the shader core can resume executing the threads of the first task.
Although context switching allows a GPU to process tasks out of the order in which they were placed in the command buffer, context switching is problematic for several reasons. As an initial matter, a substantial amount of time is required to perform a context switch, thereby limiting the performance of the GPU. Also, context switching requires additional local memory (e.g., back-up storage units) to store the context that is being switched. The additional local memory takes up precious chip area, resulting in a larger GPU.
In addition to requiring a substantial amount of time and area, context switching makes the GPU ineffective at processing low-latency, high-priority tasks. To prepare the shader core for executing a low-latency, high-priority task, a conventional GPU must perform a context switch. The time associated with the context switch (e.g., hundreds of clock cycles) makes the effective time for executing the low-latency, high-priority task relatively long, even though the actual time for executing the low-latency, high-priority task may be relatively short (e.g., tens of clock cycles).
Given the foregoing, what is needed is a processing unit that can efficiently process important (e.g., low-latency) tasks without context switching.