The present invention relates in general to multithreaded microprocessors, and in particular to dispatching instructions for execution in a multithreaded microprocessor without regard to order among threads.
To meet the needs of video gamers, simulation creators, and other program designers, sophisticated graphics co-processors have been developed for a variety of computer systems. These processors, which generally operate under the control of a general-purpose central processing unit or other master processor, are typically optimized to perform transformations of scene data into pixels of an image that can be displayed on a standard raster-based display device. In a common configuration, the graphics processor is provided with “geometry data,” which usually includes a set of primitives (e.g., lines, triangles, or other polygons) representing objects in a scene to be rendered, along with additional data such as textures, lighting models, and the like. The graphics processor performs modeling, viewpoint, perspective, lighting, and similar transformations on the geometry data (this stage is often referred to as “vertex” processing). After these transformations, “pixel” processing begins. During pixel processing, the geometry data is converted to raster data, which generally includes color values and other information for each sample location in an array corresponding to the viewable area; further transformations may be applied to the raster data, including texture blending and downfiltering (reducing the number of sample locations to correspond to the number of pixels in the display device). The end result is a set of color values that can be provided to the display device.
To provide smooth animations and a real-time response, graphics processors are generally required to complete these operations for a new frame of pixel data at a minimum rate of about 30 Hz. As images become more realistic—with more primitives, more detailed textures, and so on—the performance demands on graphics processors increase.
To help meet these demands, some existing graphics processors implement a multithreaded architecture that exploits parallelism. As an example, during vertex processing, the same operations are usually performed for each vertex; similarly, during pixel processing, the same operations are usually performed for each sample location or pixel location. Operations on the various vertices (or pixels) tend to be independent of operations on other vertices (pixels); thus, each vertex (pixel) can be processed as a separate thread executing a common program. The common program provides a sequence of instructions to execution units in an execution core of the graphics processor; at a given time, different threads may be at different points in the program sequence. Since the execution time (referred to herein as latency) of an instruction may be longer than one clock cycle, the execution units are generally implemented in a pipelined fashion so that a second instruction can be issued before all preceding instructions have finished, as long as the second instruction does not require data resulting from the execution of an instruction that has not finished.
In such processors, the execution core is generally designed to fetch instructions to be executed for the different active threads in a round-robin fashion (i.e., one instruction from the first thread, then one from the second, and so on) and present each fetched instruction sequentially to an issue control circuit. The issue control circuit holds the fetched instruction until its source data is available and the execution units are ready, then issues it to the execution units. Since the threads are independent, round-robin issue reduces the likelihood that an instruction will depend on a result of a still-executing instruction. Thus, latency of an instruction in one thread can be hidden by fetching and issuing an instruction from another thread. For instance, a typical instruction might have a latency of 20 clock cycles, which could be hidden if the core supports 20 threads.
However, round-robin issue does not always hide the latency. For example, pixel processing programs often include instructions to fetch texture data from system memory. Such an instruction may have a very long latency (e.g., over 100 clock cycles). After a texture fetch instruction is issued for a first thread, the issue control circuit may continue to issue instructions (including subsequent instructions from the first thread that do not depend on the texture fetch instruction) until it comes to an instruction from the first thread that requires the texture data. This instruction cannot be issued until the texture fetch instruction completes. Accordingly, the issue control circuit stops issuing instructions and waits for the texture fetch instruction to be completed before beginning to issue instructions again. Thus, “bubbles” can arise in the execution pipeline, leading to idle time for the execution units and inefficiency in the processor.
One way to reduce this inefficiency is by increasing the number of threads that can be executed concurrently by the core. This, however, is an expensive solution because each thread requires additional circuitry. For example, to accommodate the frequent thread switching that occurs in this parallel design, each thread is generally provided with its own dedicated set of data registers. Increasing the number of threads increases the number of registers required, which can add significantly to the cost of the processor chip, the complexity of the design, and the overall chip area. Other circuitry for supporting multiple threads, e.g., program counter control logic that maintains a program counter for each thread, also becomes more complex and consumes more area as the number of threads increases.
It would therefore be desirable to provide an execution core architecture that efficiently and effectively reduces the occurrence of bubbles in the execution pipeline without requiring substantial increases in chip area.