1. Field of the Invention
This invention relates generally to executing programs that access a cache memory and more specifically to a system and method for subdividing such programs to efficiently utilize the cache memory.
2. Description of the Related Art
A typical computer system includes a general purpose central processing unit (CPU) that executes application program instructions and generates output data that is to be displayed on a monitor. When performing computer graphics functions, the output data generated by the CPU typically consists of a high level description of a scene to be displayed, and other high level information such as from where the scene is to be viewed, and various details aimed at adding realism to the scene, such as textures to be applied to different elements in the scene.
In order to prepare this relatively complex scene, the output data from the CPU is typically processed by a graphics processing unit (GPU). Modern GPUs commonly include a processing unit, referred to as a shader unit that is generally responsible for defining the final surface properties of objects visible in the displayed scene. For example, the shader unit can define the color, reflectivity, and transluscency of object surfaces. To provide developers with flexibility to create sophisticated (e.g., highly realistic) graphics, shader units are often programmable. In other words, the shader units are capable of executing shader programs that consist of a number of program instructions that specify exactly how the final object surfaces are generated.
FIG. 1 illustrates an exemplary GPU shader unit 100 configured to execute multiple threads 110 of shader program instructions. The shader unit 100 may be programmed to operate on surface, primitive, vertex, fragment, pixel, sample or any other data. For simplicity, the term “samples” is hereinafter used to refer to any such surfaces, primitives, vertices, pixels, fragments or the like, operated on by the shader unit 100. The shader unit 100 may perform the instruction threads 110 on multiple samples in a pipelined manner, with different instructions performed on the samples as they travel down the pipeline. As illustrated, the instructions are generally classified as math operations 112 and texture fetch operations 114. The math operations 112 typically include various instructions to blend one or more texture values, fetched with the texture fetch operations 114, from texture memory to generate an output sample color.
One obstacle faced by developers is the general requirement of extremely high bandwidth to memory utilized by the shader program, as there are commonly millions of accesses to memory per second. In order to provide this bandwidth and improve performance when executing the shader program instructions, the GPU often utilizes a relatively small high speed local texture cache 120. As with all cache systems, locality is a major performance issue in texture cache systems. If a desired piece of texture data is not in the texture cache 120, latency penalties are incurred to transfer the texture data from (main) memory.
According to some shader architectures, execution of instructions in a shader program may be relatively constrained in an effort to make efficient use of the texture cache 120. For example, according to some architectures, several samples may be gathered and common instructions executed for all the samples. In this case, texture data fetched from the same texture ID may be used in instructions on several of samples. As an example, because the samples may have come from a tile (e.g., an 8×8 tile of samples), instructions may be executed on one sample, and then adjacent samples, in a very simple order. While a current texture fetch 114 is in process, subsequent instructions may be prohibited from executing, thereby preventing different texture fetches 114 from fighting for the texture cache.
Due to this constraint, this architecture may experience a very ordered pattern of memory fetches to place required texture data in the texture cache 120, resulting in generally efficient use of the texture cache 120 (i.e., with relatively few cache misses). Unfortunately, because each instruction is typically performed monolithically in such a constrained architecture, overall performance of the system may suffer. For example, in some cases, some math instructions may be prohibited from executing while waiting for completion of a texture fetch even though these math instructions may not be dependent on the fetched texture.
Some architectures may remove these constraints and allow instructions of a shader program to be performed on samples in a somewhat random order, as they are presented. As an example, a first sample may be at a later point in the shader program, while a second sample is at an earlier point (e.g., because the samples may be processed in scan line order), which may result in some advantages. For example, while operations on one sample are stalled while waiting on a dependent texture fetch, operations for another sample that are not dependent may be able to proceed. Unfortunately, texture fetches may occur somewhat randomly when executing instructions in this unconstrained manner, leading to an increased number of cache misses. In other words, gains in processing efficiency by reducing stalls in execution may be offset by increased latencies as texture data is continuously flushed from the texture cache 120 and re-fetched as multiple threads of instructions fight for cache space. While increasing the cache size may reduce this fighting, that approach is typically not desirable, as it would consume precious silicon real estate on the GPU.
Accordingly, what is needed is a technique for limited constraint when executing a shader program, preferably that allows efficient cache usage while reducing the performance impact of execution stalls.