1. Field of the Invention
The present invention relates generally to the field of high speed processors, and more specifically to a processor including a sub-core operating at a higher frequency than the rest of the execution core, and also to a replay architecture for facilitating data-speculating operation of the sub-core.
2. Background of the Prior Art
FIG. 1 illustrates a microprocessor 100 according to the prior art. The microprocessor includes an I/O ring which operates at a first clock frequency, and an execution core which operates at a second clock frequency. For example, the Intel186DX2 may run its I/O ring at 33 MHz and its execution core at 66 MHz for a 2:1 ratio (1/2 bus), the IntelDX4 may run its I/O ring at 25 MHz and its execution core at 75 MHz for a 3:1 ratio (1/3 bus), and the Intel Pentium(copyright) OverDrive(copyright) processor may operate its I/O ring at 33 MHz and its execution core at 82.5 MHz for a 2.5:1 ratio (5/2 bus).
A distinction may be made between xe2x80x9cI/O operationsxe2x80x9d and xe2x80x9cexecution operationsxe2x80x9d. For example, in the DX2, the I/O ring performs I/O operations such as buffering, bus driving, receiving, parity checking, and other operations associated with communicating with the off-chip world, while the execution core performs execution operations such as addition, multiplication, address generation, comparisons, rotation and shifting, and other xe2x80x9cprocessingxe2x80x9d manipulations.
The processor 100 may optionally include a clock multiplier. In this mode, the processor can automatically set the speed of its execution core according to an external, slower clock provided to its I/O ring. This may reduce the number of pins needed. Alternatively, the processor may include a clock divider, in which case the processor sets the I/O ring speed responsive to an external clock provided to the execution core.
These clock multiply and clock divide functions are logically the same for the purposes of this invention, so the term xe2x80x9cclock mult/divxe2x80x9d will be used herein to denote either a multiplier or divider as suitable. The skilled reader will comprehend how external clocks may be selected and provided, and from there multiplied or divided. Therefore, specific clock distribution networks, and the details of clock multiplication and division, will not be expressly illustrated. Furthermore, the clock mult/div units need not necessarily be limited to integer multiple clocks, but can perform e.g. 2:5 clocking. Finally, the clock mult/div units need not necessarily even be limited to fractional bus clocking, but can, in some embodiments, be flexible, asynchronous, and/or programmable, such as in providing a P/Q clocking scheme.
The basic motivation for increasing clock frequencies in this manner is to reduce instruction latency. The execution latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction. Suppose that a part of a program contains a sequence of N instructions, I1, I2, I3, . . . , IN. Suppose that In+1 requires, as part of its inputs, the result of In, for all n, from 1 to Nxe2x88x921. This part of the program may also contain any other instructions. Then we can see that this program cannot be executed in less time than T=L1,+L2+L3+. . .+LN, where Ln is the latency of instruction In, for all n from 1 to N. In fact, even if the processor was capable of executing a very large number of instructions in parallel, T remains a lower bound for the time to execute this part of this program. Hence to execute this program faster, it will ultimately be essential to shorten the latencies of the instructions.
We may look at the same thing from a slightly different point of view. Define that an instruction In is xe2x80x9cin flightxe2x80x9d from the time that it requires its input operands to be ready until the time when its result is ready to be used by another instruction. Instruction In is therefore xe2x80x9cin flightxe2x80x9d for a length of time Ln=An*C where An is the latency, as defined above, of In, but this time expressed in cycles. C is the cycle time. Let a program execute N instructions as above and take M xe2x80x9ccyclesxe2x80x9d or units of time to do it. Looked at from either point of view, it is critically important to reduce the execution latency as much as possible.
The average latency can be conventionally defined as 1/N*(L1+L2+L3+ . . . +LN)=C/N*(A1+A2+A3+ . . . +AN). Let fj be the number of instructions that are in flight during cycle j. We can then define the parallelism P as the average number of instructions in flight for the program or 1/M*(f1+f2+f3+ . . . +fM).
Notice that f1+f2+f3+ . . . +fM=A1+A2+A3+ . . . +AN. Both sides of this equation are ways of counting up the number of cycles in which instructions are in flight, wherein if x instructions are in flight in a given cycle, that cycle counts as x cycles.
Now define the xe2x80x9caverage bandwidthxe2x80x9d B as the total number of instructions executed, N, divided by the time used, M*C, or in other words, B=N/(M*C).
We may then easily see that P=L*B. In this formula, L is the average latency for a program, B is its average bandwidth, and P is its average Parallelism. Note that B tells how fast we execute the program. It is instructions per second. If the program has N instructions, it takes N/B seconds to execute it. The goal of a faster processor is exactly the goal of getting B higher.
We now note that increasing B requires either increasing the parallelism P, or decreasing the average latency L. It is well known that the parallelism, P, that can be readily exploited for a program is limited. Whereas, it is true that certain classes of programs have large exploitable parallelism, a large class of important programs has P restricted to quite small numbers.
One drawback which the prior art processors have is that their entire execution core is constrained to run at the same clock speed. This limits some components within the core in a xe2x80x9cweakest linkxe2x80x9d or xe2x80x9cslowest pathxe2x80x9d manner.
In the 1960s and 1970s, there existed central processing units in which a multiplier or divider co-processor was clocked at a frequency higher than other circuitry in the central processing unit. These central processing units were constructed of discrete components rather than as integrated circuits or monolithic microprocessors. Due to their construction as co-processors, and/or the fact that they were not integrated with the main processor, these units should not be considered as xe2x80x9csub-coresxe2x80x9d.
Another feature of some prior art processors is the ability to perform xe2x80x9cspeculative executionxe2x80x9d. This is also known as xe2x80x9ccontrol speculationxe2x80x9d, because the processor guesses which way control (branching) instructions will go. Some processors perform speculative fetch, and others, such as the Intel Pentium Pro processor, also perform speculative execution. Control speculating processors include mechanisms for recovering from mispredicted branches, to maintain program and data integrity as though no speculation were taking place.
FIG. 2 illustrates a conventional data hierarchy. A mass storage device, such as a hard drive, stores the programs and data (collectively xe2x80x9cdataxe2x80x9d) which the computer system (not shown) has at its disposal. A subset of that data is loaded into memory such as DRAM for faster access. A subset of the DRAM contents may be held in a cache memory. The cache memory may itself be hierarchical, and may include a level two (L2) cache, and then a level one (L1) cache which holds a subset of the data from the L2. Finally, the physical registers of the processor contain a smallest subset of the data. As is well known, various algorithms may be used to determine what data is stored in what levels of this overall hierarchy. In general, it may be said that the more recently a datum has been used, or the more likely it is to be needed soon, the closer it will be held to the processor.
The presence or absence of valid data at various points in the hierarchical storage structure has implications on another drawback of the prior art processors, including control speculating processors. The various components within their execution cores are designed such that they cannot perform xe2x80x9cdata speculationxe2x80x9d, in which a processor guesses what values data will have (or, more precisely, the processor assumes that presently-available data values are correct and identical to the values that will ultimately result, and uses those values as inputs for one or more operations), rather than which way branches will go. Data speculation may involve speculating that data presently available from a cache are identical to the true values that those data should have, or that data presently available at the output of some execution unit are identical to the true values that will result when the execution unit completes its operation, or the like.
Like control speculating processors"" recovery mechanisms, data speculating processors must have some mechanism for recovering from having incorrectly assumed that data values are correct, to maintain program and data integrity as though no data speculation were taking place. Data speculation is made more difficult by the hierarchical storage system, especially when it is coupled with a microarchitecture which uses different clock frequencies for various portions of the execution environment.
It is well-known that every processor is adapted to execute instructions of its particular xe2x80x9carchitecturexe2x80x9d. In other words, every processor executes a particular instruction set, which is encoded in a particular machine language. Some processors, such as the Pentium Pro processor, decode those xe2x80x9cmacro-instructionsxe2x80x9d down into xe2x80x9cmicro-instructionsxe2x80x9d or xe2x80x9cuopsxe2x80x9d, which may be thought of as the machine language of the micro-architecture and which are directly executed by the processor""s execution units. It is also well-known that other processors, such as those of the RISC variety, may directly execute their macro-instructions without breaking them down into micro-instructions. For purposes of the present invention, the term xe2x80x9cinstructionxe2x80x9d should be considered to cover any or all of these cases.
The invention provides a microprocessor having two or more levels of execution sub-core each clocked at different frequencies. The processor may also have an I/O ring, which may be clocked at yet another frequency. Clock division or multiplication may be used between the various levels, to derive the various clocks from a common clock, such as the I/O clock, which may be provided from off-chip. Having the different clock domains enables the designer to make trade-offs in the design of various components of the chip, such as individual execution units, instruction fetch and decode units, register files, caches, and the like. Thus, selected components can be designed to operate at a very high frequency, without requiring the entire chip to be designed to operate at this frequency. Less latency-critical units, or those whose required throughput can be obtained by twice as many units running at half the clock speed, can be relegated to the slower sections of the chip, easing their design considerably.