1. Field of the Invention
The present invention relates to an apparatus and a method for executing instructions, which instructions are provided to executing units incorporated in an information processor conforming to superscalar or out-of-order instruction execution.
2. Description of the Related Art
FIG. 6 of the accompanying drawings is a block diagram schematically showing a conventional apparatus for executing instructions, which apparatus is incorporated in a general information processor. The apparatus for executing instructions of FIG. 6 comprises instruction cache memory 1, instruction buffer (hereinafter called I-BUFFER where I represents “instruction”) 2, instruction controller 3, decoding unit 4, reservation stations 5, executing units 6, result registers 7 and operand registers 8.
Instruction cache memory 1 copies and retains a part of instructions stored in the main memory.
I-BUFFER 2 temporarily stores a group of instructions (up to a maximum of 48 instructions), each fetched from instruction cache memory 1 in accordance with an instruction from instruction address creating section (see symbol 20 in accompanying drawing FIG. 1). I-BUFFER 2 further issues the stored instructions to downstream decoding unit 4 in accordance with instruction controller 3 to be described later.
Instruction controller 3 controls issuing of instructions from I-BUFFER 2 to decoding unit 4 by instructing I-BUFFER 2 to issue an instruction stored in I-BUFFER 2 to decoding unit 4. Since decoding unit 4 of the apparatus of FIG. 6 is divided into four decoders, instruction controller 3 instructs I-BUFFER 2 to simultaneously issue to decoding unit 4 four instructions, which is the maximum that decoding unit 4 can receive at one time. In order to take the fullest advantage of superscalar instruction execution, instruction controller 3 normally instructs I-BUFFER 2 such that I-BUFFER 2 issues instructions of the maximum number (four instructions in the apparatus of FIG. 6) that decoding unit 4 can receive at the same time while I-BUFFER 2 retains an instruction.
Decoding unit 4 is divided into four parts of decoders D0, D1, D2 and D3, which receive and decode, in parallel, instructions simultaneously issued from I-BUFFER 2. Upon receipt of an instruction from I-BUFFER 2, each decoder extracts an op-code that is a format to make an information processor (an apparatus for executing instructions) recognize the contents of the instruction and an operand to indicate the object of the instruction from the received instruction.
Two reservation stations 5 are named RSEA (reservation station for execution A) and RSEB (reservation station for execution B). Each of RSEA and RSEB stores up to a maximum of eight instructions that decoding unit 4 has decoded. Each decoded instruction is stored in RSEA or RSEB until the instruction is to be executed in executing unit 6. Upon completion of an arithmetic operation in executing unit 6 based on an instruction, the immediate subsequent instruction is sent to the same executing unit 6 for execution.
Two executing units 6 are named EXA (execution unit A) and EXB (execution unit B). EXA and EXB sequentially perform arithmetic operations on the basis of instructions stored in RSEA and RSEB, respectively. Results of such arithmetic operations are sent to result registers 7 downstream.
A result of an arithmetic operation by executing unit 6 (EXA or EXB) is written in result register 7 (indicated by “RR” in accompanying drawings) and one result register 7 is installed downstream of each executing unit 6.
When the cross bypasses which connect the two executing units 6 are not used whereupon a result of an arithmetic operation by executing unit 6 is input to the same executing unit 6 for a future arithmetic operation (that is, the result of an arithmetic operation performed by EXA is input to EXA, or the result of an arithmetic operation performed by EXB is input to EXB), the result obtained by executing unit 6 is written in the associated result register 7 and is immediately input to the same executing unit 6 through route 10 of FIG. 6 in order to be used for another arithmetic operation.
On the other hand, a cross bypass is used so that a result of an arithmetic operation performed by one executing unit 6 is input to the other executing unit 6 (that is, the result of an arithmetic operation performed by EXB is input to EXA, or the result of an arithmetic operation performed by EXA is input to EXB), the result obtained by the first-named executing unit 6 is written in the associated result register 7 and further in associated operand register 8 (indicated by OPR in accompanying drawings) through route 9 of FIG. 6 and, after that, is input to the other executing unit 6 in order to be used for another arithmetic operation.
Namely, when a cross bypass is used, a result of arithmetic operation by one executing unit 6 is temporarily written in operand register 8 through route 9 of FIG. 6 after being written in the associated result register 7 and then is sent to the other executing unit 6 to be used for another arithmetic operation.
Cross bypasses represent routes, through each of which a result of arithmetic operation obtained in one executing unit 6 is input to another executing unit 6 because the latter executing unit 6 requires the result in order to execute a future arithmetic operation. Cross bypasses are therefore routes 9 in FIG. 6.
Basically in the apparatus for executing instructions of FIG. 6, instructions decoded in decoder D0 or D2 are dispatched to and temporarily stored in RSEA and then output to EXA while instructions decoded in decoder D1 or D3 are dispatched to and temporarily stored in RSEB and then output to EXB. An instruction for a destination of branching is always decoded in decoder D0. Instruction controller 3 controls I-BUFFER 2 such that I-BUFFER 2 issues the maximum four instructions to decoders D0 through D3 (decoding unit 4) from I-BUFFER 2 as long as I-BUFFER 2 retains instructions.
Those skilled in the art conceive that I-BUFFER 2 should issue a maximum number of instructions that can be simultaneously issued in order to take the fullest advantage of superscalar instruction execution when a conventional method for executing instructions is performed. For that reason, the apparatus of FIG. 6, as described above, issues to decoding unit 4 a maximum number of instructions (four instructions in this apparatus) that can be issued at one time as long as I-BUFFER 2 retains instructions.
Further, the apparatus of FIG. 6 adopts split queuing which stores groups of instructions, each group being dedicated to one of executing units 6, before each of the instructions is executed by the dedicated executing units. At this time, if one executing unit 6 uses a result of an arithmetic operation performed by the other executing unit 6 for a future arithmetic operation (i.e., a cross bypass between executing units 6 is used), the result is input to the other executing unit 6 through result register 7 and operand register 8. As a consequence, the usage of a cross bypass requires a time period (i.e., a control time period; hereinafter called 1 τ) for transmitting a result through operand register 8 longer than that when one execution unit 6 uses the result obtained by the same executing unit 6 (i.e., when a cross bypass is not used for a future arithmetic operation) to complete the instruction execution.
Such usage of cross bypasses tends to occur when a maximum number of instructions that can be issued at the same time are issued and two executing units 6 execute instructions in parallel. Especially, the usage of cross bypasses during repetitious execution of a short loop containing over ten instructions increases the time period required to complete the repetitious execution because of an extra time length for transmitting a result obtained by one executing unit 6 to the other executing unit 6.
Detailed operations in which the cross bypasses between executing units 6 are used will now be described with reference to FIGS. 5 and 7.
The below group of instructions (1) through (12) is an example of a short loop which causes usage of cross bypasses.
Each field of the left side of table FIG. 5 represents one of decoders D0 through D3 (decoding unit 4), which decodes the instruction of the short loop indicated at the left side, and one of EXA and EXB (executing units 6), which executes the same instruction, in relation to the apparatus of FIG. 6. An instruction with a symbol “-” in FIG. 5 is processed without execution performed in executing units 6 (i.e., such an instruction is a load or branch instruction).
FIG. 7 shows a relationship of time (control time period τ) and each process of executing instructions in the short loop (instructions (1) through (12)) when the apparatus of FIG. 6 repeats the execution of the short loop three times. In FIG. 7, numbers in brackets on the left edge indicate the numbers of the instructions (1) to (12), and fields of the left-side column represent the contents of instructions each corresponding to instruction indicated by a number at its left. EXA or EXB (executing unit 6) in a bracket at each field of the left-side column executes the instruction in the same field. The numbers 1 through 40 indicate the passage of time (control time periodsτ).
The letters “p”, “b”, “a”, “t”, “m”, “b”, “r” and “x” represent operations (stages) in executing an instruction: “p” represents priority; “b”, buffer; “a”, address; “t”, TLB/TAG; “m”, match; “r”, result; and “x”, execute. Execution of an instruction undergoes one or more stages.
(1) lduh [% g2+% 14], % g2
(2) subcc % g2, % l0, % g0
(3) bleu,pn % icc, (pc+0x14)
(4) or % g0, % g2, % g5
(5) subcc % o3, % 0x1, % o3
(6) bne,pt % icc, (pc+0xfffffe8c)
(7) add % g5, % l2, % o0
(8) ldub [% o0+% o2], % g2
(9) subcc % g2, % o7, % g0
(10) bne,pt % icc, (pc+0x154)
(11) and % g5, % l1, % g2
(12) sll % g2, 0x1, % g2
First of all, when executing of lduh (load) instruction (1) is started, the instruction is decoded in decoder D1 and waits for a value to be written in address % g2. Upon writing the value, data is loaded from address % g2+% l4 using the value in address % g2 and, after that, the loaded data is written in address % g2. The next subcc (subtraction) instruction (2) is decoded by decoder D2 and waits until the value is written in address % g2. Upon writing the value in address % g2, the subcc instruction (2) is executed by EXA, using the written value.
The third bleu (branch) instruction (3) is decoded by decoder D3, and the subsequent OR (logical sum) instruction (4), which is a delay slot instruction of bleu instruction (3), is decoded in decoder D0 thereby being executed in EXA after the execution of the fifth subcc (subtraction) instruction (5). Since the bleu instruction (3) assigns pn and does not branch, the subcc instruction (5) subsequent to the OR instruction (4) is decoded by decoder D1 and thereby executed by EXB.
bne (branch) instruction (6) is decoded by decoder D2 and add instruction (7), which is a delay slot instruction of the bne instruction (6), is decoded in decoder D3 and thereby executed in EXB. At that time, since the add instruction (7) utilizes the result of execution of the OR instruction (4), the cross bypass from EXA to EXB is used at control time period 8 τ, as shown in FIG. 7. The usage of the cross bypass occurs at control time period 20 τ during the second cycle of the execution of the short loop and at control time period 32 τ during the third cycle for the same reason.
lduh (load) instruction (8), which is an instruction for the destination of branching of the bne instruction (6), is executed, using the result of an arithmetic operation of the add instruction (7). Since the lduh instruction (8) is an instruction for the destination of branching of the bne instruction (6), the lduh instruction (8) is decoded by decoder D0. Data loaded from address % o0+% o2 during execution of the lduh instruction (8) is written in address % g2 at “r” stage. subcc (subtraction) instruction (9) subsequent to the lduh instruction (8) is decoded by decoder D1 and waits until the loaded data is written in address % g2 at the “r” stage of the lduh instruction (8) and, after that, is executed in EXB, using the written data.
While the subcc instruction (9) waits for the writing of the loaded data in address % g2, AND instruction (11), which is a delay slot instruction of bne (branch) instruction (10), and the subsequent sll instruction (12) are executed.
After the AND (logical product) instruction (11) is decoded by D3 and is executed by EXB, the sll instruction (12), which is the destination of the bne instruction (10), is decoded by decoder D0 and thereby executed by EXA. As mentioned above, since EXB executes AND instruction (11) and EXA, the sll instruction (12), the cross bypass from EXA to EXB is used at control time period 11 τ. The usage of the cross bypass due to the same reason occurs at control time period 23 τ during the second cycle of the execution of the short loop and at control time period 35 τ during the third cycle.
Upon execution of the sll instruction (12), the lduh instruction (1) again loads data from address % g2+% l4 using the result of the arithmetic operation of the sll instruction (12). The loaded data is written in address % g2, and then execution of subsequent instructions is repeated.
As shown in FIG. 7, the apparatus for executing instructions of FIG. 6 requires 12 τ control time periods to complete the execution of the entire short loop having instructions (1) through (12) once. During the one cycle execution, the usage of cross bypass occurs twice. The usage of the cross bypasses prevents the result obtained in one executing unit 6 from being immediately input to the other executing unit 6 thereby requiring a great-amount of extra time to complete instruction execution.
In order to avoid such usage of the cross bypasses, only one of two execution units 6 may execute instructions. This definitely avoids the usage of the cross bypasses, however the instruction execution takes a longer time than that allowing the usage of the cross bypasses because parallel execution is not performed in this instruction execution. Parallel instruction execution which can prevent the cross bypass from being used has been demanded in order to shorten the time required for the instruction execution.