1. Field of the Invention
Embodiments of the present invention relate generally to computer graphics and more specifically to observing shader performance using registers to track event occurrences.
2. Description of the Related Art
A shader engine is a hardware unit within a graphics processing unit (GPU) responsible for the execution of shader programs. Some types of shader engines provide shading functionality that produces “shaded” pixel data. Other types of shader engines process vertex and geometry data, as opposed to pixel data. The output of these shader engines is generally written to a GPU memory. As GPU designs advance and become more complex, the design of the shader engine also becomes more complex. A shader engine is commonly comprised of two or more individual shader units configured to function in parallel, thereby increasing the throughput of data processed by the shader engine. Furthermore, each shader unit may allow the execution of multiple instances of a shader program (also known as threads) to increase the parallel processing capability of the shader engine. This complex and parallel shader architecture enables relatively large amounts of data to be efficiently processed.
Typical shader unit architectures may include functional blocks (also referred to herein as “functional units”) such as a program controller, instruction cache, address generator and one or more execution pipes. Each functional block may, in turn, have many complex processing stages as well. In operation, the various instructions making up the different threads executing within the shader unit are fetched one-by-one by the program controller. These instructions are decoded, and then the data associated with each of the threads are processed in the execution pipes in accordance with the decoded instructions.
One drawback to this type of shader architecture is that performance optimization of a given shader unit design or a shader program is difficult. First, as previously mentioned, the shader unit contains many functional blocks, each of which may be configured in multiple ways. For example, there may be several ways to configure a given functional block that provide substantially the same functionality; however, each configuration may have a different data throughput rate. Since there are so many different configurations for the overall shader unit, determining the optimal configuration for the unit (e.g., the configuration that provides the highest pixel throughput) is quite difficult and requires a shader designer to characterize and compare the performance of the many different shader unit configurations during the design process. Second, although the pixel throughput of a given shader unit configuration can be measured, there is no straight-forward way to observe what happens within the different functional blocks of a shader unit when one or more threads are executed within the shader unit. This task is made more difficult by the fact that each thread that is executed within the shader unit may represent different shader programs or even different types of shader programs. Thus, acquiring the information necessary to evaluate the performance of any particular shader unit or shader program configuration is also quite difficult.
As the foregoing illustrates, what is needed in the art is an improved means for observing performance within a shader unit to facilitate shader unit and shader program design.