As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
Modern processor architectures often rely on instruction units to fetch program instructions from a memory, decode the instructions, and then dispatch those instructions to one or more execution units for execution. Pipelining may be used in such instruction units to maximize instruction throughput, and moreover, multithreading may be used to enable multiple instructions (either from multiple instruction streams or from the same instruction stream) to be dispatched to different execution units within a processor during a given cycle. In addition, predictive logic such as branch prediction may be used in order to try to guess in advance what instructions will be needed from a given instruction stream so that those instructions can be fetched earlier to minimize the delay that would otherwise occur were those instructions retrieved from memory only after the condition for a conditional branch was actually tested.
The goal of an instruction unit is to provide instructions to execution units as quickly as possible to maximize instruction throughput, and thus the overall performance of the processor. In this regard, many instruction units incorporate instruction buffers, which are high speed dedicated memory arrays that temporarily store instructions awaiting execution. Given that a processor typically executes at a much faster rate than instructions can be retrieved from a memory, an instruction buffer serves to maintain a pool of instructions awaiting execution so that the execution units are starved of instructions as infrequently as possible.
In general, the larger the instruction buffer, the less likely execution units will ever be starved of instructions. Assuming branches in instruction streams are correctly predicted, and assuming enough instructions are maintained in the instruction buffer to cover for any misses to the instruction cache that feeds the instruction unit, the flow of instructions to the execution units will be maximized.
A larger instruction buffer, however, does not come without some drawbacks. First, larger instruction buffers require more circuitry, and thus increase power consumption and take up valuable real estate on a chip. Second, the performance penalty that occurs whenever an instruction buffer needs to be flushed (e.g., due to a branch mispredict) is increased for larger instruction buffers. Instruction buffers are fundamentally similar to shift registers, and as such, a new instruction added to an empty instruction buffer may require several clock cycles to progress to the end of the instruction buffer where it is ready to be dispatched to an execution unit, thereby leaving several cycles in which the execution units are starved of work.
Therefore, a need exists in the art for an improved instruction buffer design that is more efficient and more tolerant of flushes.