Similar to an assembly line for producing a complex product in a manufacturing plant, pipelining is widely used in integrated circuit designs to improve performance. In processors, such performance improvement is achieved through parallelization of sequential computations, generally through separate computational elements. Being simpler, these individual computational elements can run at faster clock speeds, leading to performance gains.
Aggressive pipelining in synchronous systems does not always lead to more efficient designs since there are some drawbacks associated with heavily pipelined designs. Deeper pipelines imply higher computation latencies, so that scheduling operations through the various pipeline stages in complex designs such as processors also complicates the controller design.
During execution of a series of operations, the change of program flow due to instructions such as a jump or a function call (or similar branching instructions) lead to loss of performance in pipelined processors. By the time a decision is made to begin executing instructions from a different memory address than the one currently being executed (i.e., a different branch), a number of instructions from the current instruction flow are already executing at various stages within the pipeline. These undesired instructions consume useful processor resources, resulting in a loss of performance generally termed as “branch penalty”. Higher numbers of pipeline stages generally lead to higher branch penalties.
To avoid significant loss of processor performance due to branch penalties, most contemporary high performance processors utilize some type of branch prediction. One techniques commonly employed involves predicting the results of a conditional branch at the instruction fetch stage based on previous history of the branch instructions, which adds to the general complexity of the processor design.
By way of example, a typical processor pipeline generalized for any modern processor is shown in FIG. 3. The elements shown are arranged according to the pipeline sequence (top down) rather than by layout within a processor. Within the pipeline design shown, branch instructions are first detected at the decode stage and subsequently processed by the branch processing unit, which (as shown) is generally found at the same pipeline stage as other execution units. The branch processing unit resolves the branch and computes a new address from which to fetch the next instruction(s). The new fetch address must then be communicated back to the fetch/pre-fetch unit, located at the top or beginning of the pipeline stages.
Due to the large gap between the fetch and branch-processing unit, the intermediate pipeline stages are filled with speculative instructions. Depending on the nature of the branch prediction scheme, these speculative instructions will involve a certain mix of sequential or target instructions. If the branch resolves to be not taken, then the sequential instructions in the pipeline may continue with execution, whereas the target instructions need to be dropped. If the branch resolves to be taken, on the other hand, the target instructions should proceed normally while the sequential instructions are aborted.
Depending on the length of a pipeline, the branch penalty in terms of performance loss due to dropping sequential or target instructions from the pipeline may be significant and requires effective branch prediction schemes to minimizes performance losses.
While most modern processors achieve high performance through aggressive pipelining and complex designs with attendant high silicon area, packaging and cooling costs, these higher cost penalties for performance improvement are generally considered acceptable for mainstream processors employed in workstations. Other applications, however, may require high performance but also need to limit the size and/or complexity of the processor. For example, a network processing system which employs a cluster of processor cores (e.g., eight or more) may require a high performance processor design while limiting the complexity of the micro-architecture.
There is, therefore, a need in the art for instruction fetch and branch processing obtaining high performance using a simple pipeline design without any branch prediction.