1. Field of the Invention
The present invention relates to the field of microprocessor micro-architecture design and more specifically to a method and apparatus for manipulating in a pipelined processor a data register file having a stack organization.
2. Description of the Related Art
Many microprocessors today operate on data contained in a register file which has a stack organization. Microprocessors such as Intel's 8087, 80287, 80387, and i486 microprocessors have floating point units which store and access data contained in a register file which has a stack organization. Details of the x87 architecture may be found in Microprocessors published by Intel Corp. On the 80.times.87 architecture which has a stack organization for its registers, there are eight registers numbered 000 through 111. These registers are never addressed directly by the instruction opcode, but rather are addressed as a stack. At any time one register is referred to as the top of stack register or TOS register. The corresponding register number (address between 000 and 111) is stored in a 3-bit field called TOS (top of stack) in a 16-bit register called the Status Word register. An instruction always has its first source operand implicitly in the stack top. The address of the second operand is specified as a 3-bit value or index off the stack top, so that the actual register number has to be obtained by adding this 3-bit index to the register number at the stack top.
For example, a stack of data may look like the following:
______________________________________ Stack 1 Data Stack Position ______________________________________ TOS A ST.0. B ST1 C ST2 ______________________________________
In this example, data A is located at the top of stack. Data A is stored in a data register which is currently considered the top of stack register, or the STo register. The present STo register is specified by the TOS component of a Status Word register which contains the address or "points to" the data register which is the current top of stack register or STo register. Data B is stored in a data register which currently has its position one from the top of stack or the ST1 register. Data C is located in a register which currently has its position two from the top of stack or is in the ST2 register.
When new data is added to the stack, it is "pushed" onto the stack. For example, if X were "pushed" onto stack 1, the new stack would look like the following:
______________________________________ Stack 2 Data Stack Position ______________________________________ TOS X ST.0. A ST1 B ST2 C ST3 ______________________________________
Now data X is at the top of stack. Data X is stored into a data register which now becomes the top of stack register or the STo register. The TOS component of the Status Word register is updated with a new address to reflect the new top of stack register. Datas A, B and C each remain in their previously assigned data registers but each of the data registers is now a stack position lower. Datas A, B, and C are now said to be stored, respectively, in the ST1, ST2 and ST3 registers.
When data is removed from the stack register file to main memory, it is said to be "popped" off the stack. The data contained in the data register which is currently at the the top of stack position is removed from the stack and each of the data currently stored in lower stack positions move up a respective stack position. For example, if data A were "popped" from stack 1, the stack would look like the following:
______________________________________ Stack 3 Data Stack Position ______________________________________ TOS B ST.0. C ST1 ______________________________________
The register which stores data B, is now considered the top of stack register, and the TOS component of the Status register is updated with its address to reflect the change. The register which contains data C is now one from the top of stack or is the ST1 register. It is important to note that all "pushes" and "pops" utilize the top of stack register.
Instructions operate on operands which are addressed by their relative position on the stack. Instructions address the stack either implicitly or explicitly. An instruction whose opcode provides an explicit register address is said to address that register explicitly. All instructions that have two operands address the stack top implicitly for one operand and explicitly address a register for the second operand.
For example, a simple add instruction may look like:
FADD STo ST2
This instruction implicitly addresses the stacktop and explicitly addresses the register second from the top of the stack. This instruction adds the contents of the register which is considered to be at the top of the stack, STo, to the contents of the register which is second from the top of stack or the ST2 register. The result of the addition by definition is written into the STo register. Similarly, a store looks like:
FST
This instruction implicitly stores the data in the STo position into main memory. Instructions which provide no operand implicitly utilize the top of stack register.
Prior art microprocessors which access and store data in a register file having a stack organization have stack manipulation units as shown generally in FIG. 1. Provided is a data register file 12 consisting of eight registers (Ro-R7) which are accessed by stack addressing. A 16-bit processor Tag Word 14 is provided and indicates if the associated register is full or empty, and if full, whether the operand is special (NAN, Infinite, etc.) or not. Working registers R8-RN 13 are also provided. A TOS component 16 of a Status Word register provides a 3-bit address of one of the eight data registers which is currently the top of stack register.
The typical instruction processing algorithm in the prior art entails first moving operands from the stack registers 12 into working registers R.sub.8 -R.sub.N 13, then operating entirely within the space of the execution hardware and working registers 13, and finally transferring results back to the destination stack register.
Instructions are decoded into a stream of microcontrol vectors (or .mu.vectors). Each .mu.vector is divided into a plurality of microcontrol fields, each of which provides control directives for manipulating hardware of the microprocessor, As shown in FIG. 2a, in the prior art there are several fields, namely STKADRS, WRADRS, and STKOP, in each microvector which control the manipulation of the stack.
The STKADRS field 20 supplies an offset (000-111) to address generating logic 24. The offset is added to the TOS component 16 of the Status Word register to generate the physical register address (Ro-R7) of an instructions operand or the physical register address where the instruction's result (destination) is to be written. In the prior art, only a single register (either the source or the destination) can be addressed by each vector. That is, the STKADRS field in each .mu.vector only provides directives for addressing one register.
The STKOP field 26 provides directives in each .mu.vector for setting and checking of the TAG's of the 16-bit processor TAG word 14 which is associated with the register addressed by the STKADRS field of the present .mu.vector. In the prior art, the STKOP field can only check or set the TAGs of a single register, the one presently addressed by the .mu.vector. Additionally, the STKOP field 26 provides directives to the TOS update logic 30 for controlling the updating of the TOS component 16 of the Status Word register with the new address of the physical register which is currently the top of the stack register.
The WRADRS field 22 supplies a working register address to where the data of the addressed stack register is written. The working registers are used for the manipulation of data during the execution of an instruction. That is, instruction operands are written into specified working registers, executed on and the result later transferred to the destination registers in the stack register file.
There are two distinct disadvantages with the prior art method and apparatus for stack manipulation. First, in the prior art, instruction latency (time required to execute an instruction) is too large for modern processors because of limited hardware that has to be re-used, and because instructions have to be broken up into a plurality of .mu.vectors in order to implement all of the required stack manipulation operations. And second, in the prior art stack manipulation a "top of stack bottleneck" develops if it is implemented in a microprocessor which has pipelined micro-architecture.
In the prior art stack manipulation apparatus, each instruction is broken up into several .mu.vectors in order to implement all of the stack manipulation procedures. In the prior art there is only a single read/write port to the stack register file 12. The control circuitry and microcode organization is designed around this single port to the register file. That is, since only a single register can be read or written into at a time, each .mu.vector addresses only one register. As such, in the prior art there is only one field, the STKADRS field 20, in each microvector for addressing a stack register. Similarly, the STKOP field 26 only controls the check/setting of one register at a time, the stack register addressed by the the STKADRS field.
For example, as shown in FIG. 2b, a simple instruction such as FADD STo, ST2 in the prior art requires at least three .mu.vectors to implement all of the required stack manipulation procedures. The first .mu.V.sub.1 supplies in the STKADRS field 20 an offset of zero (000), to the address generating logic to generate the stack physical address of the first operand, STo (i.e. zero is added to the TOS address to generate the STo address). The STKOP field 26 in the same .mu.vector, .mu.V.sub.1, supplies control directives to TAG check/set logic 28 to check the TAG of the first operand. A second .mu.vector, .mu.V.sub.2, supplies in the STKADRS field 20 an offset 2 (010) which is added to the value in the TOS component of the Status Word 16 in the address generating logic 30 to generate the stack register address of the second operand, ST2. The second .mu.vector .mu.V.sub.2 supplies in the STKOP field 26 control directives for checking the TAG of this addressed register to see whether or not the register contains valid data. And finally a third .mu.vector, .mu.V.sub.3, provides in the STKADRS field an offset zero (000) which is used to generate the destination register address for the result of the add. The third .mu.vector, .mu.V.sub.3, supplies in the single STKOP field directives for setting the addressed register at full, and for updating the TOS component of the Status Word.
Thus, in the prior art, the STKOP field can only check or set the TAGs of one register at a time, the register currently addressed by the STKADRS field. For every instruction in the prior art, at least one .mu.vector is required for each operand/destination in order to implement the required stack manipulation procedures. Such large instruction latency is unacceptable in microprocessors where high execution speed is desired. Microprocessors which desire high speed instruction execution cannot use the prior art method of manipulating the stack. Thus, what is desired is a method and apparatus for manipulating with a single .mu.vector the stack of a stack organized microprocessor with a single .mu.vector, so that instruction latency can be dramatically decreased.
Another disadvantage with the prior art method of stack manipulation is the "top of stack bottleneck" which develops when the prior art scheme is utilized in a pipelined processing unit. In processors having a stack organization, the top of stack register (STo) is the most heavily used register. This is because all single operand instructions operate upon the STo register and replace it with a result. Two operand instructions always use the STo register for one of the two operands while the second operand is accessed via an index added to the top of stack address. The result from two operand instruction are written back to either the STo register or the other register. All loads from memory load into the STo register. And still, all stores to memory read the operand from the STo register. Thus, the STo register sees the most traffic in a register file having a stack organization.
Thus, before any "single operand" operation can be performed on an operand, the operand must be brought to the stack top. Additionally, for any two operand instructions, one operand must be brought to the stack top. Likewise, before an operand in the register may be stored to memory it must also be brought to the stack top. A programmer, therefore, in the x87 architecture judiciously uses the FXCH instruction to deploy his operands where desired. The FXCH instruction exchanges data between the top of stack register and a second register.
For example, let's assume a programmer wishes to execute an ADD of the top two numbers in the stack, and after the ADD, the programmer wishes to store to memory the operand which is presently contained in the third level of the stack. A programmer would normally execute an FXCH STo, ST1 instruction between the two instructions to free up the top of stack register. This would generate an instruction stream which is:
FADD STo, ST1
FXCH STo, ST2
FST
The first instruction adds the contents of STo to ST1 register and stores the result in the STo register. The second instruction exchanges the contents of STo register with the contents of ST2 register. In this way, the former content of the ST2 register is now available in the STo register. The third instruction FST can then store the operand originally in the third stack position, to external memory.
In order to increase instruction throughput modern processors have pipelined execution units. In a pipelined processor, as shown in FIG. 3, the execution of an instruction is distributed into a plurality of stages. Each stage only takes a fraction of the execution time necessary to execute a complete instruction. Instructions enter the pipeline at the first stage, proceed to subsequent stages during each clock cycle until they reach the last stage. Ideally, in pipelined processors, a new instruction enters the execution pipeline every clock. In this way, multiple instructions are overlapped in execution at one time. Although instructions still require several clocks to execute, instruction throughput (rate at which instructions complete execution) is essentially one clock cycle. Thus ideally, in a pipelined processor, one instruction completes execution every clock cycle. It is important to note, however, that it is not until the last stage of the pipeline that the results of a given instruction are available.
FIG. 3 shows what happens when the prior art stack manipulation technique is implemented in a pipelined processor that exchanges actual data on a FXCH instruction. The instruction stream:
FADD STo, ST1
FXCH STo, ST2
FST
cannot be overlapped in execution and a top of stack bottleneck develops. The FADD instruction begins its execution during clock 1 and finishes execution and reports its result to the STo register during clock 5. It is not until clock 5 that the result of the FADD instruction is available in the STo register. Since the STo register does not have the result until clock 5, the processor cannot begin to execute the FXCH instruction until the STo register has a value which it can exchange with the ST2 register. As such the FXCH instruction cannot begin execution at clock 2, as is intended in pipeline processors, but must stall until clock 6 which is when the STo register has the result of the FADD. Stalling the FXCH instruction decreases instruction throughput dramatically. In fact, the advantages of pipelining the execution unit are lost when the prior art stack manipulation technique is utilized. This is because instruction throughput gets limited by an artificial dependency created due to the multiple use of the stack top register, rather than by a true data dependency.
Thus, what is needed is a method and apparatus for stack manipulation in a processor having a pipelined execution unit wherein instruction latency due to stack manipulation is decreased and where FXCH instructions can be overlapped in execution with other instructions.