The present invention relates to a power conservation method in a processor.
FIG. 1 is a block diagram illustrating the process of program execution in a conventional processor. Program execution may include three stages: front end 110, execution 120 and memory 130. The front-end stage 110 performs instruction preprocessing. Front end processing 110 typically is designed with the goal of supplying valid decoded instructions to an execution core with low latency and high bandwidth. Front-end processing 110 can include branch prediction, decoding and renaming. As the name implies, the execution stage 120 performs instruction execution. The execution stage 120 typically communicates with a memory 130 to operate upon data stored therein.
FIG. 2 illustrates high-level processes that may occur in front-end processing. A front-end may store instructions in a memory, called an “instruction cache” 140. A variety of different instruction formats and storage schemes are known. In the more complex embodiment, instructions may have variable lengths (say, from 1 to 16 bytes in length) and they need not be aligned to any byte location in a cache line. Thus, a first stage of instruction decoding may involve instruction synchronization 150—identifying the locations and lengths of each instruction found in a line from the instruction cache. Instruction synchronization typically determines the location at which a first instruction begins and determines the location of other instructions iteratively, by determining the length of a current instruction and identifying the start of a subsequent instruction at the next byte following the conclusion of the current instruction. Once the instruction synchronization is completed, an instruction decoder 160 may generate micro-instructions from the instructions. These micro-instructions, also known as uops, may be provided to the execution unit 120 for execution.
Conventionally, front end processing 110 may build instruction segments from stored program instructions to reduce the latency of instruction decoding and to increase front-end bandwidth. Instruction segments are sequences of dynamically executed instructions that are assembled into logical units. The program instructions may have been assembled into the instruction segment from non-contiguous regions of an external memory space but, when they are assembled in the instruction segment, the instructions appear in program order. The instruction segment may include microinstructions (uops).
A trace is perhaps the most common type of instruction segment. Typically, a trace may begin with an instruction of any type. Traces have a single entry, multiple exit architecture. Instruction flow starts at the first instruction but may exit the trace at multiple points, depending on predictions made at branch instructions embedded within the trace. The trace may end when one of number of predetermined end conditions occurs, such as a trace size limit, the occurrence of a maximum number of conditional branches or the occurrence of an indirect branch or a return instruction. Traces typically are indexed by the address of the first instruction therein.
Other instruction segments are known. Intel engineers have proposed an instruction segment, which they call an “extended block,” that has a different architecture than the trace. The extended block has a multiple-entry, single-exit architecture. Instruction flow may start at any point within an extended block but, when it enters the extended block, instruction flow must progress to a terminal instruction in the extended block. The extended block may terminate on a conditional branch, a return instruction or a size limit. The extended block may be indexed by the address of the last instruction therein. The extended block and methods for constructing them are described in Jourdan, et al., “eXtended Block Cache,” HPCA-6 (January 2000).
A “basic block” is another example of an instruction segment. It is perhaps the most simple type of instruction segment available. The basic block may terminate on the occurrence of any kind of branch instruction including an unconditional branch. The basic block may be characterized by a single-entry, single-exit architecture. Typically, the basic block is indexed by the address of the first instruction therein.
Regardless of the type of instruction segment used in a processor 110, the instruction segment typically is stored in a segment cache 170 for later use. Reduced latency is achieved when program flow returns to the instruction segment because the instruction segment may store instructions already decoded into uops and assembled in program order. Uops from the instruction segments in the segment cache 170 may be furnished to the execution stage 120 faster than they could be furnished from different locations in an ordinary instruction cache 140.
Many instruction segments, once built and stored within a segment cache 170, are never used. This may occur, for example, because program flow does not return to the instructions from which the instruction segment was constructed. Some other instruction segments may be reused quite often. However, because a segment cache 170 may have a limited capacity (say, 1024 uops), low segment reuse causes even frequently-used instruction segments to be overwritten by other instruction segments before their useful life otherwise might conclude. Thus, with a high eviction rate in the segment cache 170, the advantages of instruction segments can be lost.
Conventionally, a front end stage 110 may include a segment builder 180 provided in communication with the instruction decoder 160 to capture decoded uops and build instruction segments therefrom. The segment builder 180 typically includes buffer memories to store the uops and a state machine to detect segment start and end conditions and to manage storage of instruction segments within the segment cache 170.
The techniques for implementation and management of instruction segments consume tremendous amounts of power. Power must be provided for the segment cache 170 and the segment builder 180. The segment cache 170 must be integrated with other front-end components, such as one or more branch predictors (not shown). And, of course, as implementation of instruction segments becomes more complex, for example, to employ concepts of traces or extended blocks, the power consumed by the circuits that implement them also may increase. By way of example, the front-end system of the IA-32 processors, products commercially available from Intel Corporation of Santa Clara, Calif., consumes about 28% of the overall processor power.
As mobile computing applications and others have evolved, raw processor performance no longer is the paramount consideration for processor designs. Modern designs endeavor to provide maximize processor performance within a given power envelope. Given the considerable amount of power spent in front-end processing, the inventors perceived a need in the art for a front end unit that employs power control techniques. It is believed that such front end units are unknown in the art.