1. Field of the Invention
The present invention relates to a microprocessor, and more particularly, to a parallel executing apparatus of a load instruction which can execute the load instruction as well as the exchange instruction in parallel in dual pipe-lined processor, and also can be applied to the design part which needs the technique that simultaneously executes the floating point arithmetic and the floating point load instruction.
2. Description of Prior Art
In the processor that has a data storage place of a stack structure, any instruction should use a value in a stack-top as an operand. Accordingly, to execute the new instruction that is pipe-lined,it should take the new operand to the stack-top. There are two ways to take the new operand to the stack-top, using the load instruction and the exchange instruction.
In the conventional technique, among the two ways, only the way that executes the exchange instruction in parallel is used. But an artificial dependency which is in the stack-top, can not be perfectly removed using that way only.
When any instruction executes in the processor including a register file of the stack structure, let's find out how the stack structure is used and what effect it has on the instruction process.
In order for a certain instruction to be executed, one or two source operands to which the operation applies and one destination operand to which the result of this operation is to be stored are necessary. Of course, there are particular instructions which don't take the source operand or the destination operand. But most of the instructions need the source operand and the destination operand.
In the register file of the stack structure, as cpu assigns a register by the source operand and destination operand, it's possible to assign through the stack-top register or to a relative offset from the top. Therefore, in case the stack-top is changed, the offsets that assign lower registers than the stack-top, should be changed.
The especially important fact is that an instruction needing a source operand must use the stack-top as the source operand. That is, the instruction which has one source operand should use the stack-top register as the source operand, and the instruction which needs two source operands should use the stack-top register as at least one of the source operands. Also, most of the instructions use the stack-top register as the destination operand which stores a result of the operation. So, in order for a certain instruction to be executed, it can be operated only when the necessary source operand is moved to a stack-top. As a result, the processor with the register file of the stack structure create a bottleneck, and the problem to be solved is how efficiently to solve the bottleneck.
As we mentioned above, when an instruction is executed, there should be data which is required in the stack-top. If the data is a result of the previous instruction (i.e the data have a true dependency), we should wait until the execution of the previous instruction is finished.
But if the data is not related to a result of the previous instruction, the data should be taken to a stack-top as quickly as possible. However, we should wait until the operation is finished because most of the previous instructions store the result of the operation in the stack-top. This case is not dependent on the actual data value, but has an artificial dependency which occurs only because we should use the stack-top. So, to remove the bottleneck of the stack-top, it is possible to use a method that executes as if there is no dependency when this artificial dependency occurs. In the conventional technology, a method (U.S. Pat. No. 5,367,650) that effectively executes the exchange instruction which exchanges a content of the stack-top with a content of other register in the stack is used. The method can be explained referring to FIG. 1A to FIG. 1C.
FIG. 1A shows the stack structure and its contents before the execution. R0-R3 shows the actual data value. ST0-ST3 shows the relative offset representing the relation of the top and lower register in the stack. FIG. 1B shows the operation to be executed, and the operation is the sum of the result of [R0-R1] and that of [R2-R3]. It is composed of instructions (i.e, a multiplication instruction of floating point data Fmul, an exchange instruction of floating point data Fxch, a subtraction instruction of floating point data Fsub, an addition instruction of floating point data Fadd and a storing instruction of floating point data Fst) compiled through an operation compiler. Those are shown in a right side of FIG. 1B. The Fmul instruction in "1" and the Fsub instruction in "3" do not depend upon prior calculations for the necessary source operand data. However, because we use the stack structure, there is an artificial dependency between the two instructions due to the stack-top. So, before executing the Fsub instruction, firstly data R2 stored in the ST2 must be put into the stack-top by means of the execution of the Fxch instruction. In the case of the consecutive execution of the Fxch instruction, referring to in FIG. 1C, the Fxch instruction is stalled in the pipe-line until the Fmul instruction processing is completed. And then, data R2 and the result of the Fmul instruction should be exchanged. Up to this, the efficiency is rapidly dropped because the Fsub instruction must wait until the Fxch instruction is finished.
Therefore, the previous invention created the method that can execute the Fxch instruction and the Fmul instruction in parallel. However, this result is to get data R2 of ST2 instead of data R0 of ST0 in FIG. 1C.
The problem of the conventional technique as above is that the source operand of the next operation (operand as data R2 in the example of FIG. 1) is always inside of the stack. Therefore, the prior art method can't entirely remove the artificial dependency which exists in the stack-top.