The present invention relates to a recording scheme for instruction segments in a processor core in which instructions from instruction segments may be cached in reverse program order.
FIG. 1 is a block diagram illustrating the process of program execution in a conventional processor. Program execution may include three stages: front end 110, execution 120 and memory 130. The front-end stage 110 performs instruction pre-processing. Front end processing is designed with the goal of supplying valid decoded instructions to an execution core with low latency and high bandwidth. Front-end processing can include branch prediction, decoding and renaming. As the name implies, the execution stage 120 performs instruction execution. The execution stage 120 typically communicates with a memory 130 to operate upon data stored therein.
Conventionally, front end processing 110 may build instruction segments from stored program instructions to reduce the latency of instruction decoding and to increase front-end bandwidth. Instruction segments are sequences of dynamically executed instructions that are assembled into logical units. The program instructions may have been assembled into the instruction segment from non-contiguous regions of an external memory space but, when they are assembled in the instruction segment, the instructions appear in program order. The instruction segment may include instructions or uops (micro-instructions).
A trace is perhaps the most common type of instruction segment. Typically, a trace may begin with an instruction of any type. Traces have a single entry, multiple exit architecture. Instruction flow starts at the first instruction but may exit the trace at multiple points, depending on predictions made at branch instructions embedded within the trace. The trace may end when one of number of predetermined end conditions occurs, such as a trace size limit, the occurrence of a maximum number of conditional branches or the occurrence of an indirect branch or a return instruction. Traces typically are indexed by the address of the first instruction therein.
Other instruction segments are known. The inventors have proposed an instruction segment, which they call an “extended block,” that has a different architecture than the trace. The extended block has a multiple-entry, single-exit architecture. Instruction flow may start at any point within an extended block but, when it enters the extended block, instruction flow must progress to a terminal instruction in the extended block. The extended block may terminate on a conditional branch, a return instruction or a size limit. The extended block may be indexed by the address of the last instruction therein.
A “basic block” is another example of an instruction segment. It is perhaps the most simple type of instruction segment available. The basic block may terminate on the occurrence of any kind of branch instruction, including an unconditional branch. The basic block may be characterized by a single-entry, single-exit architecture. Typically, the basic block is indexed by the address of the first instruction therein.
Regardless of the type of instruction segment used in a processor 110, the instruction segment typically is cached for later use. Reduced latency is achieved when program flow returns to the instruction segment because the instruction segment may store instructions already assembled in program order. The instructions in the cached instruction segment may be furnished to the execution stage 120 faster than they could be furnished from different locations in an ordinary instruction cache.
While the use of instruction segments has reduced execution latency, they tend to exhibit a high degree of redundancy. A segment cache may store copies of a single instruction in multiple instruction segments, thereby wasting space in the cache. The inventors propose to reduce this redundancy by merging one or more segments into a larger, aggregate segment or by extending one instruction segment to include instructions from another instruction segment with overlapping instructions. However, extension of segments is a non-trivial task, for several reasons.
First, instructions typically are cached in program order. To extend instruction segments at the beginning of the segment would require previously stored instructions to be shifted downward through a cache to make room for the new instruction. The instructions may be shifted by varying amounts, depending upon the number of new instructions to be added. This serial shift may consume a great deal of time which may impair the effectiveness of the front-end stage 110.
Additionally, the extension may destroy previously established relationships among the instruction segments. Instruction segments not only are cached, but they also are indexed by the front-end stage 110 to identify relationships among themselves. For example, program flow previously may have exited a first segment and arrived at a second segment. A mapping from the first instruction segment to the second instruction segment may be stored by the front-end stage 110 in addition to the instruction segments themselves. Oftentimes, the mappings simply are pointers from one instruction segment to the first instruction in a second instruction segment.
Extension of instruction segments, however, may cause new instructions to be added to the beginning of the segment. In such a case, an old pointer to the segment must be updated to circumvent the newly added instructions. If not, if the old mapping were used, the front-end stage 110 would furnish an incorrect set of instructions to the execution stage 120. The processor 100 would execute the wrong instructions.
Accordingly, there is a need in the art for a front-end processing system that permits instruction segments to be extended dynamically without disruption to previously stored mappings among the instruction segments.