I. Field of the Invention
This invention relates generally to computer technology, and more particularly, to improving processor performance in a computer system.
II. Background Information
Processors execute a series of program instructions. Some processors achieve high performance by executing multiple instructions per clock cycle. The term xe2x80x9cclock cyclexe2x80x9d refers to an interval of time accorded to various stages of an instruction processing pipeline within the processor. The term xe2x80x9cinstruction processing pipelinexe2x80x9d refers to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
A branch instruction is an instruction which typically causes subsequent instructions to be fetched from one of at least two addresses: a sequential address identifying an instruction stream beginning with instructions which directly follow the branch instruction; and a target address identifying an instruction stream beginning at another location in memory. When it is known whether or not an instruction being processed in the pipeline will cause a branch, and to what address the instruction will cause a branch, the branch is resolved. Branch instructions typically are not resolved until after the execution stage. Waiting for the branch instruction to be resolved would starve the pipeline and severely impact performance because it is unknown which instructions to load into the pipeline until after the branch is resolved. In order to maintain optimum performance of the processor, it is necessary to predict the instruction subsequent in program order to the control-flow instruction and dispatch that instruction into the instruction processing pipeline.
A branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instruction, allowing subsequent instruction fetching to continue within the predicted instruction stream indicated by the branch prediction. In this way, branch prediction allows program execution to be done in greater parallel. When using branch prediction, instructions from the predicted instruction stream may be placed into the instruction processing pipeline prior to execution of the branch instruction.
Branch prediction allows for greater processor performance (and thus greater computer system performance) by preventing the pipeline from being idle until the branch is resolved. That is, branch prediction allows for instructions to be fetched, decoded, and executed in the direction of a predicted instruction stream even before the branch is resolved thus preventing the processor from being idle until the branch is resolved. If the predicted instruction stream is correct, then the number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cycle is decreased.
Well known techniques for branch prediction exist. Some use static information, such as the direction and the distance of the branch, others use run time information which consists of prior history as to whether branches were taken or not to predict whether or not future branches will be taken.
As described earlier, branch prediction is one way to improve processor performance. Another technique for improving processor performance is data speculation. Data speculation, among other things, addresses the problem of the growing gap between main memory and processor clock speeds. As a result of this gap, computer system performance is increasingly dominated by the latency of servicing memory accesses, particularly those accesses which are not easily predicted by the temporal and spatial locality captured by conventional cache memory organizations. Temporal locality describes the likelihood that a recently-referenced address will be referenced again soon, while spatial locality describes the likelihood that a close neighbor of a recently-referenced address will be referenced soon. If data can be correctly predicted then the processor is spared the time required for memory access (i.e., access to the cache, main memory, disk drive, etc.) in order to get that data.
Current data speculation methods include load value prediction where the results of loads are predicted at dispatch by exploiting the affinity between load instruction addresses and the data the loads produce. This method takes advantage of the fact that memory loads in many programs demonstrate a significant degree of data locality.
Branch prediction gives us insights into data values so that data speculation can be efficiently performed. Further, this data speculation is xe2x80x9csafexe2x80x9d because a branch misprediction causes the pipeline to be flushed thus discarding all the instructions involved in the incorrect data speculation. The problem with current methods of data speculation, however, is that they do not exploit the insights provided by branch prediction in order to increase processor performance.
For the foregoing reasons, data dependency collapsing based on control-flow speculation can enhance processor performance.
The present invention is directed to an apparatus and method for collapsing one or more operands. An embodiment of the present invention includes a post-decode unit which upon decoding an instruction that modifies its zero flag when executed, records information in a first entry about the operands for that particular instruction. Upon decoding an instruction that is either a branch if equal instruction and predicted as taken or a branch if not equal instruction and predicted as not taken, the post-decode unit copies the recorded information in the first entry to one of the second entries. The post-decode unit also translates the operands of an instruction if information is recorded about the operands in one of the second entries and that recorded information is valid.