1. Field of the Invention
The present invention relates generally to the data processing field. Still more particularly, the present invention relates to a computer implemented method, apparatus, and computer program product for performing out of order instruction folding and out of order instruction retirement.
2. Description of the Related Art
A processor's performance is measured by the number of instructions performed per clock cycle (IPC). An instruction is an order given to a computer processor by a computer program. At the lowest level, each instruction is a sequence of 0s and 1s that describes a physical operation the computer is to perform, such as “Add”. In addition, the instruction may specify the storage areas called registers that may contain data used in carrying out the instruction, or the location in computer memory of data. The clock cycle is the time between two adjacent pulses of the oscillator that sets the tempo of the computer processor. The number of pulses per second is known as the clock speed, which is generally measured in MHz (megahertz, or millions of pulses per second) and in GHz (gigahertz, or billions of pulses per second).
Pipelining is an implementation technique for increasing the number of instructions performed per cycle. Pipelining can be thought of as an assembly line for computer instructions. The pipeline is divided into segments called stages, whereby multiple instructions are overlapped in execution. A typical pipeline consists of five stages: an instruction fetch stage, an instruction decode stage, an execution stage, a memory access stage, and a write back stage.
In the case of a simple processor architecture, such as a scalar processor, one instruction per clock cycle is executed. In other words, only one instruction at a time can enter the pipeline. The instructions inside the pipeline move to the next stage after the slowest instruction completes its stage. The optimal performance increase for a pipelined instruction set over an unpipelined instruction set would be equal to a multiplicity factor of the number of stages employed in the pipeline. However, most instruction sets have data dependencies that do not allow for full pipelining. Therefore, the optimal performance of the pipelined instruction set is generally not achieved. In addition, other factors limit the performance increase associated with the pipeline, such as, limitations arising from pipeline latency, an imbalance among the pipe stages, pipeline hazards, and pipelining overhead.
Another method of increasing the number of instructions performed per clock cycle is to fold instructions. Instruction folding occurs when two more instructions are executed in the same clock cycle. Instruction folding may be performed in a superscalar processor having multiple versions of each functional unit to enable execution of more than one instruction in parallel. However, instruction folding is costly because additional logic gates are required to implement data dependency checks and time delays for depending instruction.