1. Field of the Invention
The present invention generally relates to parallel computation systems and, more specifically, to efficient placement of texture barrier instructions.
2. Description of the Related Art
In conventional computing systems having both a central processing unit (CPU) and a graphics processing unit (GPU), the CPU assigns computational tasks to the GPU and provides the GPU with executable kernels for performing the computational tasks. The CPU may implement a thread program compiler, configured to generate the executable kernels. Each executable kernel includes instructions that the GPU may execute in parallel. Each instance of an executing kernel program is referred to as a thread, and a set of threads is typically organized to execute as a thread group or “warp” or threads. A given computational task performed by the GPU typically comprises a set of multiple threads configured to perform highly parallel operations, such as generating and texturing pixels within an image or sequence of images.
During the course of performing these highly parallel operations, the GPU typically generate multiple memory requests that may remain simultaneously outstanding for significant spans of time. Each thread may generate memory requests in a specific order, and each memory request may require a significant number of machine cycles to complete. Each destination register for a memory request is typically configured to operate in conjunction with a scoreboard circuit, which tracks memory request completion to enforce proper instruction execution semantics regardless of when a particular memory request is actually fulfilled. For example, when a memory read request is generated to a particular address in memory, a target register is named as storage for the data to be read from memory. At the same time, a scoreboard register is configured to wait for the memory read request to be fulfilled before allowing any instructions that use the register to execute.
While scoreboard techniques are known in the art to produce semantically correct results, implementing a scoreboard system to manage large numbers of pending operations destined for a large number of storage registers within a modern GPU requires significant overall die area and power consumption. Power consumption is an important factor in GPU viability for many mobile devices, which rely primarily on battery power for operation. Reducing power consumption in mobile devices enables longer battery life and more efficient design. Power consumption is also an important factor in high-throughput server systems, as power consumption has a direct impact on operating costs and computational density of the server systems.
Accordingly, what is needed in the art is a technique for reducing GPU power consumption in scenarios involving multiple outstanding memory requests.