There are a number of techniques for increasing throughput in a central processing unit (CPU). One is to increase instruction level parallelism by using a superscalar architecture. This increases the performance of a single thread by allowing more than one instruction from the instruction stream to execute per clock cycle. Another is to increase thread level parallelism by using a multi-core or simultaneous multi-threaded architecture which can allow instructions from more than one thread to operate in parallel.
As the width of a superscalar architecture increases (e.g. the number of instructions that can be executed per clock cycle increases), there are correspondingly more instructions in the pipelines that can affect program flow (e.g. branches) at one time. Moreover, a number of these branches are conditional and it is difficult to know for certain the outcome of these branches until preceding instructions have progressed further down the pipeline. Therefore to maintain increased throughput the outcome of branches are predicted using a speculative technique known as branch prediction. Typically, the wider the superscalar processor, the more speculative the predictions. While correct predictions can dramatically increase the instruction throughput, incorrectly predicted instructions not only do not contribute to the instruction throughput, they tie up valuable resources. For good prediction accuracy, the size of the branch prediction hardware becomes large.
Despite these drawbacks, branch prediction and other speculative techniques are important for good single-threaded throughput in a superscalar processor.
Multithreaded processors typically execute fewer instructions per thread per clock cycle, but across a number of threads can execute multiple instructions (usually known as simultaneous multi-threading). These can maintain a high overall throughput of instructions with lower overall levels of speculation as each thread is not attempting to run as far ahead, i.e. each thread has fewer instructions in progress at any one time.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known processors.