1. Field of the Invention
The present invention is generally directed to computing operations performed in computer systems. More particularly, the present invention is directed to processing units that perform computing operations in computer systems.
2. Background
A graphics-processing unit (GPU) is a complex integrated circuit that is adapted to perform graphics-processing tasks. A GPU may, for example, execute graphics-processing tasks required by an end-user application, such as a video-game application. The GPU may be a discrete (i.e., separate) device and/or package or may be included in the same device and/or package as another processor (e.g., a CPU). For example, GPUs are frequently integrated into routing or bridge devices such as, for example, Northbridge devices.
There are several layers of software between the end-user application and the GPU. The end-user application communicates with an application-programming interface (API). An API allows the end-user application to output graphics data and commands in a standardized format, rather than in a format that is dependent on the GPU. Several types of APIs are commercially available, including DirectX® developed by Microsoft Corporation of Redmond, Washington and OpenGL® promulgated by Khronos Group. The API communicates with a driver. The driver translates standard code received from the API into a native format of instructions understood by the GPU. The driver is typically written by the manufacturer of the GPU. The GPU then executes the instructions from the driver.
The graphics-processing tasks performed by GPUs typically involve complex mathematical computations, such as matrix and vector operations. To efficiently perform these computations, GPUs typically include an array of processing elements, called a shader engine. The array of processing elements is organized into single-instruction, multiple-data (SIMD) devices. A shader engine executes a sequence of instructions, called a shader program. The data needed to execute the shader program is distributed in parallel to different processing elements of the shader engine. The different processing elements may then perform the same operation on different pieces of the data. In this way, a GPU can perform the complex mathematical computations required for graphics-processing tasks more quickly than a typical central-processing unit (CPU).
In the past, GPUs may have included different shader engines to execute the different shader programs required to complete a single graphics-processing task. For example, a single graphics-processing task may require the execution of at least two different shader programs: a vertex shader to manipulate vertices of a triangle; and a pixel shader to determine pixels to be displayed on a display device (e.g., computer screen). To perform these two sets of computations, a typical GPU may have included two different shader engines: (i) a first shader engine to perform the vertex shader; and (ii) a second shader engine to perform the pixel shader.
Recently, GPUs have been designed to include a unified shader engine. A unified shader engine includes an array of processing elements capable of performing several different types of shader programs. A unified shader engine may execute, for example, a vertex shader, a geometry shader, and a pixel shader—with each shader recirculating through the array of processing elements of the unified shader, rather than progressing to different shader engines in a pipeline. In addition to the typical graphics-processing tasks (e.g., vertex shaders, geometry shaders, pixel shaders, etc.), unified shader engines have also been used more recently to perform general-compute operations (e.g., mathematical algorithms, physics simulations, etc.).
To remain competitive, the compute power of the GPU should continually increase to keep up with consumer demand and advances in the requirements of end-user applications and APIs. One way to increase the compute capability of a GPU is to increase the number of processing elements in the array of the shader engine. However, to provide workloads and data to the increased number of processing elements, the input/output busses feeding the processing elements would need to correspondingly increase just to maintain presently available capabilities of a GPU.
A potential solution for increasing the compute power of a GPU is to increase the width of the SIMDs included in the shader engine. However, this solution would have problems with SIMD divergence. SIMD divergence occurs when different threads running on a SIMD device take different directions in a branch instruction of a shader program. For example, a shader program may have a branch instruction as illustrated in Table 1. SIMD divergence would occur, for example, if a first thread running on a SIMD device enters the “if” section (i.e., operation 1) of the branch instruction and a second thread running on the SIMD device enters the “else” section (i.e., operation 2) of the branch instruction. In this scenario, the second thread (which entered the “else” section) would have to wait for the first thread (which entered the “if” statement). The waiting associated with SIMD divergence costs a shader program additional time to execute. Due to the potential for SIMD divergence, simply increasing the width of the SIMDs may not be a viable option for increasing the compute power of a GPU.
TABLE 1if(condition){   operation 1;}else{   operation 2;}
Another potential solution for increasing the compute power of a GPU is to increase the stack of processing elements (e.g., SIMDs) in the array of the shader engine. However, this solution is problematic for several reasons. As an initial matter, increasing the stack of processing elements could result in an elongated chip, potentially creating manufacturing issues. In addition, increasing the stack of processing elements creates an increased input latency associated with providing workloads to the stack and an increased output latency associated with routing the results from the stack. Moreover, there would be an increased latency for providing data (e.g., state data) to the stack. Thus, simply increasing the depth of the stack of the processing elements may not be a viable option for increasing the compute power of a GPU.
Given the foregoing, what is needed is a GPU with increased compute power and applications thereof.