1. Field of the Invention
This invention relates to processors and, more particularly, to fine-grained multithreaded execution within a processor.
2. Description of the Related Art
Many conventional processor implementations attempt to increase performance by increasing the number of instructions the processor can concurrently execute from a single execution thread. For example, typical superscalar processor architectures include multiple execution units, such as load/store units, arithmetic logic units, branch processing units, etc. If such a superscalar processor can identify sufficient instruction-level parallelism within a given execution thread, it may correspondingly improve performance by executing those instructions in parallel in the multiple execution units.
However, increasing the amount of parallelism available within a single thread has proven to be a difficult problem. The presence of conditional branches in code creates challenges in predicting which instruction path to issue from, and speeding instruction execution using superscalar techniques offers little benefit if the instructions executed in parallel were fetched from an incorrectly predicted path. Correspondingly, considerable design effort and implementation area are often devoted to branch prediction in superscalar architectures, in order to keep execution units busy.
Though branches may be successfully predicted at least some of the time, predictors are often considerably less useful in resolving the problem of memory latency. Most superscalar processors include local caches to provide rapid access to instructions and data. However, such caches invariably miss, incurring substantial delays as the processor must access more distant caches or system memory to satisfy its memory request. Such delays may effectively stall or starve the conventional single-threaded superscalar processor, such that over time, the average utilization of processor resources is poor relative to the processor's peak throughput capability.