Microprocessors are electronic devices that can perform a variety of computing operations in response to instructions that implement a computer program. A common technique for implementing computer programs is to send multiple instructions in a predetermined sequence to an execution unit of the microprocessor. The execution unit executes the instructions in accordance with the predetermined sequence. The order in which instructions are executed is referred to as control flow. In some circumstances, control flow does not follow a linear path but jumps to an instruction which is not the next instruction in the sequence. Instructions that cause jumps in the control flow are referred to as control instructions, an example of such a control instruction is the branch instruction. When a branch instruction is executed, the execution unit jumps to the target instruction identified by the branch instruction and executes the target instruction next, rather than the next sequential instruction.
Microprocessors often process instructions using several processing stages, which can include an instruction fetch stage in which instructions are fetched from an instruction memory; an instruction decoding stage in which the fetched instructions are decoded; an instruction execution stage in which the instructions are executed, and a writeback stage in which the results of the executed instructions are written back to some kind of memory. To improve the speed at which multiple instructions are processed through the various stages, some microprocessors use a technique known as pipelining.
In a so called pipelined processor comprising multiple processing stages, the multiple processing stages concurrently perform different parts of the process on several different instructions. In an exemplifying pipelined processor, a first instruction is processed by the first stage of the pipelined processor in a first clock cycle. In a second clock cycle, the first instruction is processed in a second stage of the pipelined processor and a second instruction is processed by the first stage. In a third clock cycle, the first instruction is processed by a third stage of the pipelined processor, the second instruction is processed by the second stage and a third instruction is processed by the first stage. This procedure continues for all instructions until they have been processed by the total number of processing stages.
In high performance Central Processing Units (CPUs), pipelines with a large number of processing stages are needed to achieve a high clock frequency. These pipelined processors are efficient at executing linear code, i.e., instructions in a predetermined sequential order, but have difficulties in executing branch instructions. The changes in control flow caused by branch instructions can lead to substantial loss of performance. For example, an error in the branch prediction, such as a wrong prediction of the next instruction, may cause the processor to stall. A wrong prediction may also cause the currently pipelined sequence of instructions to be thrown out or “flushed” from the pipelined processor, and a new sequence of instructions must be loaded into the pipeline. A flush of the pipeline results in stall cycles for the execution stage and therefore degrades the performance of the processor. The number of stalls resulting from a flush is proportional to the pipeline depth and is referred to as the misprediction penalty.
To improve the performance of a pipelined processor, the first pipeline stages or “frontend” of the pipelined processor are often made speculative. This means that the frontend of the pipelined processor is configured to predict the next address in the instruction memory at which to continue fetching instructions without knowing the architectural state (i.e. the state of the processor after execution of all previous instructions) or even having fully decoded the already fetched instructions.
One of the main functions of a speculative frontend is the prediction of branches. Branch prediction consists of two tasks: 1) predicting if there is a branch that should be taken, and 2) predicting where to branch to, i.e., predicting the branch target.
There are two main attributes of branches that have consequences for the prediction function:                1. The branch condition. Conditional branches are only taken when the condition is fulfilled, whereas branches without a branch condition are always taken.        2. The branch target type. Branches that have the branch target address encoded as part of the branch instruction are referred to as direct branches. When the branch instruction is decoded the target is known with certainty. Branches that refer to a storage location for the branch target are called indirect branches. The value stored at the storage location that is to be used as the branch target can change up until the last instruction before the branch instruction.        
In case of fixed instruction sizes where a small number of instructions are fetched each cycle, the positions of the instructions in the memory word are known and therefore a limited amount of decoding can be done on this data. Basically a small number of partial decoders are used in parallel. These partial decoders can identify branches very early in the pipeline. In this case only for indirect or conditional branches the branch needs to be predicted.
In case of variable instruction sizes the instructions can start at arbitrary addresses in the instruction memory. In such case partial decoding becomes too costly because all possible positions of the instructions in the memory word must be tried simultaneously. This means that even for direct branches the branch target will only be available in the decode stages and therefore the branch target has to be predicted for all branch instructions.
US 2008/0168263 to Park discloses a pipelined processor comprising an instruction fetch unit receiving an instruction from a memory. The instruction fetch unit provides the received instruction to an instruction decoder unit that decodes the instruction and provides the decoded instruction to an execution unit for execution. In addition to providing the received instruction to the decoder unit, the instruction fetch unit provides an instruction portion to a branch prediction unit. The instruction portion comprises an instruction address. By means of the instruction portion the branch prediction unit predicts a next instruction and provides a predicted address to the instruction fetch unit. The instruction fetch unit fetches the instruction corresponding to the predicted address from the memory and loads it onto the pipeline. The branch prediction unit also receives a definite next address from the execution unit when the instruction has been executed and by means of this definite next address, the branch prediction unit can determine whether the previously predicted next instruction was the correct next instruction. If the prediction was correct, the pipeline continues forward with the pipelined sequence of instructions. However, if the prediction was incorrect, the processor flushes the pipeline and loads the instruction indicated by the next address indication.
U.S. Pat. No. 7,519,798 to Chung discloses a branch prediction system comprising a processor core, a branch predictor and a branch target buffer. The processor core may output branch information to the branch predictor. The branch information may represent an address of a current instruction and/or may indicate whether a previous instruction is a branch instruction. The branch predictor may predict a branch of the current instruction using the address of the current instruction, and may output a final branch prediction result to the processor core. By means of the final branch prediction result the branch predictor may indicate to the processor core when to fetch the branch address from the branch target buffer.
A drawback with the methods and apparatuses disclosed in US 2008/0168263 and U.S. Pat. No. 7,519,798 is that they are based on the operation of a single instruction at a time, i.e. at each clock cycle, causing these methods and apparatuses to be both time consuming and power consuming.
US 2010/0169625 to Wang et al. discloses a pipeline microprocessor comprising an instruction fetch stage in which a plurality of instructions are fetched from an instruction cache for further processing and execution. Instructions are divided into control flow instructions and non-control flow instructions and it has been determined that only control flow instructions can be branch instructions. The microprocessor comprises further a branch target buffer, a branch history table and a control instruction identification unit. The control instruction identification unit is configured to identify control instructions in the fetched group of instructions. Once the control instructions have been identified, access to the branch target buffer and the branch history table can be limited so that only the control flow instructions are looked up in the branch target buffer.
A drawback with the method and apparatus disclosed in US 2010/0169625 is that the instruction type, i.e. control flow instruction or non-control flow instruction, has to be known, requiring an extra analysis step to be performed by the control instruction identification unit causing also this method and apparatus to be unnecessary time and power consuming.