1. Field of the Invention
Embodiments of the present invention generally relate to a parallel and pipelined graphics architecture and, more specifically, to a bandwidth compression for shader engine store operations.
2. Description of the Related Art
Graphics systems may adopt a highly parallel and pipelined architecture to meet the ever increasing demands for realism, quality, and real-time interactivity of displayed images and videos. Therefore, some graphics processing units (“GPU”) may include a plurality of streaming multiprocessors (also commonly referred to as “shader engines”), each of which is configured with a single-instruction, multiple-data (“SIMD”) architecture and capable of executing a single instruction across different data being processed within multiple streaming processors included within the streaming multiprocessor. The series of instructions issued to a single streaming processor is referred to herein as a “thread,” and a collection of concurrently executing threads among the streaming processors within the streaming multiprocessor is referred to herein as a “thread group.” Additionally, several thread groups may be active (in different phases of execution) at the same time on a given streaming multiprocessor, resulting in a large number of threads executed in one streaming multiprocessor every clock cycle. For example, a streaming multiprocessor supporting 24 active thread groups at the same time, with each thread group including 32 threads, may execute 768 threads per clock cycle. Thus, for a GPU that includes several streaming multiprocessors, the number of threads executing every clock cycle may quickly rise to several thousand.
Each instruction executed by a streaming multiprocessor may perform arithmetic, logical and/or memory operations, including write (“STORE”) and read (“LOAD”) operations to and from the frame buffer memory. The streaming multiprocessors and the frame buffer memory for the GPU commonly reside in different chips, requiring all frame buffer accesses to be conducted over a chip-to-chip interconnect. The interface between the graphics rendering engine and the external memory devices making up the frame buffer is referred to herein as the “frame buffer interface,” and the data bandwidth of the frame buffer interface is referred to herein as the “memory bandwidth.” Depending on the design, 4B of memory bandwidth is required for each thread to carry out a STORE or LOAD operation where scalar data is written to or read from the frame buffer memory. In such a case, a thread group that includes 32 threads would require 128B of memory bandwidth to support 32 simultaneous write or read operations to the frame buffer memory. Further, depending on the design, a single chip-to-chip interconnect may transmit 16B of data per clock cycle. Thus, eight clock cycles would be required for all 32 threads to write or read their respective scalar data to or from the frame buffer memory. Even though the frame buffer interface may include several chip-to-chip interconnects connecting streaming multiprocessors with different partitions of the frame buffer memory, the data bandwidth necessary to accommodate the read/write requests from the potentially thousands of threads executing across the different streaming multiprocessors is usually much greater than the available memory bandwidth. As a result, the unprocessed read/write requests may accumulate at the frame buffer interface, substantially limiting the memory-related performance of the graphics system.
As the foregoing illustrates, what is needed in the art is a technique that effectively increases memory bandwidth for performing STORE and LOAD operations for a thread group.