1. Field of the Invention
Embodiments of the present invention generally relate to graphics processing. More particularly, embodiments of the present invention relate to improving graphical shader by using delay.
2. Related Art
A graphics processing unit (GPU) is a semiconductor device that specializes in rapidly processing graphical data compared to a typical central processing unit (CPU). The GPU is also utilized for general purpose processing tasks typically performed by the CPU. Within the GPU, there is a graphical shader that performs numerous operations on graphical data to obtain desired optical features and that interfaces with a texture unit. The texture unit further modifies the graphical data to have desired texture and optical features. In some implementations, the texture unit is implemented as part of the graphical shader. Generally, the fabricated GPU utilizes most of the semiconductor area available on the semiconductor chip die. In particular, the graphical shader uses a significant portion of the semiconductor area. Moreover, the processing speed of the GPU is measured by the amount of graphical data that is processed in any time period. Further, the amount of graphical data that can be processed by the graphical shader substantially affects the processing speed of the GPU. Hence, improvements in processing by the graphical shader lead to performance enhancements for the GPU.
Typically, the graphical shader processes or operates on groups of graphical data. The size of these groups depends on various factors. For example, the group may be a pixel quad (or quad), which is a 2×2 pixel block. However, the group may have other number of pixels. In general, the group may have from 1 to M pixels, where M is an integer.
Recently, improvements in graphical processing by the graphical shader have been achieved through the use of a graphical shader architecture that provides for multiple shader pipelines, enabling parallel processing of the graphical data. Each shader pipeline is programmable to perform graphical processing operations on graphical data. For example, multiple quads are distributed to a shader pipeline for processing. The number of quads distributed to the shader pipeline is called a “segment”. The segment may loop through the shader pipeline several times before the shader pipeline completes processing the segment. Each loop through the shader pipeline is called a “shader pass” and executes one or more instructions of a shader program. Further, the shader pipelines have access to texture and/or memory data which are often cached for improved performance.
Conventionally, a segment is distributed to one of the shader pipelines after the segment is available. If the graphical shader has four operable shader pipelines and four segments are available, each segment will be distributed to a respective shader pipeline, usually but not restricted to a sequential distribution manner. That is, the four shader pipelines operate concurrently and may require concurrent use of the graphical shader's and GPU's resources. This causes several problems. Some of these problems are shown in FIG. 1.
FIG. 1 illustrates a timing scheme 100 for a conventional graphical shader having multiple shader pipelines. As shown in FIG. 1, four shader pipelines (e.g., SP1, SP2, SP3, and SP4) have received a respective segment sequentially. Each shader pipeline performs a first shader pass (Pass 1) and a second shader pass (Pass 2). During the first shader pass (Pass 1), each shader pipeline (e.g., SP1, SP2, SP3, and SP4) typically accesses the cache of the graphical shader. For example, the cache access may be needed to obtain a particular texture for the respective segment being processed.
Between time T=Ta and T=Tb, the four shader pipelines (e.g., SP1, SP2, SP3, and SP4) are performing the first shader pass (Pass 1) on their respective segment. That is, up to four shader pipelines (e.g., SP1, SP2, SP3, and SP4) may simultaneously request access to the cache of the graphical shader. This situation creates bursts in cache accesses that overwhelm the cache bandwidth. Moreover, this situation creates cache memory coherency problems because, at the same time, data requests from different shader pipelines often have poor correlation because their graphical pixels are often temporally and spatially disjoint from each other. As described above, the graphical shader uses a significant portion of the semiconductor area. An increase in the size or number of caches may be expensive and impractical. Cache memory coherency problems arise when cache lines are overwritten by one shader pipeline due to the finite number of cache lines in the cache while the cache lines are still in use by another shader pipeline. The net effect of these problems is degradation in graphical processing performance. These problems are amplified as the number of shader pipelines is increased.