Recent advances in computer performance have enabled graphic systems to provide more realistic graphical images using personal computers and home video game computers. In such graphic systems, a number of procedures are executed to “render” or draw graphic primitives to the screen of the system. A “graphic primitive” is a basic component of a graphic picture, such as a vertex, polygon, or the like. All graphic pictures are formed with combinations of these graphic primitives. Many procedures may be utilized to perform graphic primitive rendering.
Specialized graphics processing units (e.g., GPUs, etc.) have been developed to optimize the computations required in executing the graphics rendering procedures. The GPUs are configured for high-speed operation and typically incorporate one or more rendering pipelines. The hardware of a typical GPU's rendering pipeline(s) is optimized to support an essentially linear topology, where instructions are fed into the front end of the pipeline and the computed results emerge at the bottom of the pipeline. For example, typical prior art linear pipelines tightly couple instruction fetch operations with the resulting calculation operations. Even with parallel instruction fetches, the calculations are tightly coupled with their corresponding instruction fetches.
To maximize throughput and overall rendering speed, the pipeline architecture is such that the execution hardware of the pipeline is non-stallable. This means intermediate results within the pipeline advance step-by-step through the pipeline with successive clock cycles. The pipeline cannot be stalled by means of wait states or the like. Consequently, the front end of the pipeline, or the instruction fetch portion, is similarly a non-stallable instruction fetch pipeline, where once an address is issued, the instruction fetch will occur and the resulting instruction must either be used or thrown away.
Furthermore, in a multithreaded execution environment, the issue stage of the pipeline (e.g., the pipeline portion typically just below the fetch stage) has a finite amount of instruction storage per processor thread. This storage is used to keep instructions available for all threads executing at all times.
A problem exists in the fact that for a non-stallable pipeline, the finite amount of instruction storage per thread is often insufficient to avoid starving subsequent stages of instructions unless usable instructions are in flight from the instruction cache to the storage of the issue stage. For example, modern GPUs support branches and the like in shader programs executing on the GPU. Branching has the effect of flushing all instructions in the pipeline. Similarly, data dependencies and other types of data hazards often result in stalling the instruction fetch stage, having the effect of backing up the pipeline.
One prior art solution was to keep track of how many instructions were fetched per thread, and then fetch a new instruction when an instruction was issued. This solution can be inefficient due to the timing problems caused by the decision of whether or not to issue (and which thread to issue) can happen late in a cycle, and an instruction fetch state machine typically cannot recover in time to select some other instruction. This solution also had problems when a branch was taken. For example, there was no mechanism to adjust the priority of the flushed thread. Thus, what is required is an efficient mechanism for fetching instructions for non-stalling pipelines.