1. Field of the Invention
The present invention relates to parallel processing microprocessors, and more particularly, to a layout of a parallel processing microprocessor for carrying out in parallel a plurality of processes in response to a plurality of corresponding instructions.
2. Description of the Background Art
FIG. 9 shows a layout of an example of a conventional microprocessor. Referring to FIG. 9, the microprocessor includes one functional unit 1 for processing data of 32 bits and a 32-bit register file 5. Register file 5 can store data of 32 bits. Functional unit 1 processes the 32-bit data stored in register file 5 and writes back resultant data of 32 bits to register file 5.
Functional unit 1 is formed of 32 processing circuits 100-131. Register file 5 is formed of 32 storage elements 500-531. Each storage element can store data of 1 bit. Each processing circuit processes 1-bit data stored in the corresponding storage element and writes back the 1-bit resultant data to the corresponding storage element. For example, a 0th bit portion 100 of functional unit 1 processes 1-bit data stored in a 0th bit portion 500 of register file 5 and writes back the 1-bit resultant data to 0th bit portion 500 of register file 5. A first bit portion 101 of functional unit 1 processes 1-bit data stored in a first bit portion 501 of register file 5 and writes back the 1-bit resultant data to first bit portion 501 of register file 5. Second bit portions 102, 502 through 31st bit portions 131, 531 perform similar operations as bit portions 100, 500 and 101, 501.
Functional unit 1 is pipelined, and thus, before completing the process of one instruction, it can start processing the next instruction. In functional unit 1, each bit portion is divided into three pipeline stages 10, 20 and 30.
For example, second bit portion 102 consists of an execution stage 10, a memory stage 20 and a writeback stage 30. Execution stage 10 includes a register 11 which can store 1-bit data provided from second bit portion 502 of register file 5 through a tristate buffer 41, a register 12 which can store 1-bit data provided from second bit portion 502 of register file 5 through a tristate buffer 42, and a logic circuit 13 formed by an ALU (Arithmetic and Logic Unit) for performing a logic operation of the data stored in registers 11 and 12. Memory stage 20 includes a register 21 which can store resultant data provided from logic circuit 13 in execution stage 10, and a logic circuit 22 for performing a logic operation of the data stored in register 21. Writeback stage 30 includes a register 31 which can store resultant data provided from logic circuit 22 in memory stage 20. Although only the pipeline stages of second bit portion 102 are shown in FIG. 9, other bit portions are pipelined similarly to second bit portion 102.
FIG. 10 is a timing chart showing pipeline operations of the microprocessor shown in FIG. 9. In this microprocessor, one instruction is sequentially processed in five pipeline stages.
Referring to FIGS. 9 and 10, in a first instruction fetch stage IF, one instruction is fetched from a memory (not shown) to an instruction decoder (not shown). In a second instruction decode stage ID, the fetched instruction is decoded by the instruction decoder, and in response to the decoded instruction, data is read out from register file 5 through tristate buffers 41 and 42 to registers 11 and 12. In a third execution stage EXC(10), logic circuit 13 executes the instruction and performs a logic operation of the data stored in registers 11 and 12. The resultant data of the logic operation is stored in register 21. In a fourth memory stage MEM(20), logic circuit 22 performs a logic operation of the data stored in register 21. For some instructions, data is read out from a memory. The resultant data of logic circuit 22 is stored in register 31. In a fifth writeback stage WB(30), data stored in register 31 is written back to register file 5 through a result bus 45.
Referring to FIG. 10, an instruction i1, for example, is fetched in the nth cycle. Subsequently, in the (n+1) cycle, the fetched instruction i1 is decoded, and at the same time, the next instruction i2 is fetched. Thereafter, in the (n+2) cycle, instruction i1 is executed and instruction i2 is decoded simultaneously. Also at the same time, the next instruction i3 is fetched. In the (n+3) cycle, responsive to instruction i1, the corresponding data is read out from a memory. At the same time, instruction i2 is executed and instruction i3 is decoded. Simultaneously, the next instruction i4 is fetched.
In the (n+4) cycle, the resultant data of instruction i1 is written back to register file 5. Simultaneously with this writeback operation, in response to instruction i2, the corresponding data is read out from a memory, instruction i3 is executed, and instruction i4 is decoded. In the (n+5) cycle, the resultant data of instruction i2 is written back to register file 5. At the same time, responsive to instruction i3, the corresponding data is read out from a memory and instruction i4 is executed. In the (n+6) cycle, the resultant data of instruction i3 is written back to register file 5, and simultaneously, the corresponding data is read out from the memory in response to instruction i4. In the (n+7) cycle, the resultant data of instruction i4 is written back to register file 5.
Here, in writeback stage WB, the resultant data is written back to register file 5 in a first half of one cycle. In instruction decode stage ID, data is read out from register file 5 in the second half of one cycle. For example, the resultant data obtained by executing instruction i1 is written back to register file 5 in the first half of the (n+4) cycle. When instruction i4 is to use this resultant data, the resultant data is read out in the second half of the (n+4) cycle from register file 5 in accordance with instruction i4. In this example, instruction i4 is executed in the (n+5) cycle without a temporary halt.
However, when instruction i2 is to use the executed result of instruction i1, the resultant data of instruction i1 has not been written into register file 5 in the (n+2) cycle in which instruction i2 reads out data from the register file. Therefore, if data is read out from register file 5 at this point, the read-out data is the one obtained before the resultant data is written. As a result, proper operation is not carried out.
Similarly, when instruction i3 is to use the executed result of instruction i1, the resultant data of instruction i1 has not been written into register file 5 in the (n+3) cycle in which instruction i3 reads out data from register file 5. As a result, if data is read out from register file 5, it is the data obtained before the resultant data has been written, and proper operation is not carried out.
Therefore, execution of instructions i2 and i3 must be halted until the executed result of instruction i1 is written into register file 5 in order to perform proper processings, leading to a decline in processing speed of instructions. In this case, the processing of instruction i2 must be halted for two cycles, and that of instruction i3 must be halted for one cycle.
In order to prevent such a decline in processing speed, a bypass circuit is provided for extracting an executed result of an instruction during its passage along the pipeline in the pipelined microprocessor in FIG. 9. The bypass circuit is formed by tristate buffers 14, 15, 23 and 24 and supply buses 43 and 44. Supply bus 43 provides a first source operand to execution stage 10. Supply bus 44 provides a second source operand to execution stage 10.
Now, operation of the bypass circuit will be described. In a general operation in which instruction i4 uses the executed result of instruction i1, what is necessary is to simply read the data already written in register file 5. Thus, tristate buffer 41 is rendered conductive and tristate buffers 14 and 23 are rendered non-conductive. As a result, data in register file 5 is read out through tristate buffer 41 to supply bus 43 and stored in register 11 of execution stage 10.
When instruction i2 is to use the executed result of instruction i1, data which is already in the execution stage may be used. Therefore, tristate buffer 14 is rendered conductive and tristate buffers 41 and 23 are rendered non-conductive. As a result, the resultant data of logic circuit 13 is read out through tristate buffer 14 to supply bus 43 and stored in register 11 of execution stage 10. The bypass circuit can thus transfer the result executed in execution stage 10 of instruction i1 to execution stage 10 of instruction i2.
When instruction i3 is to use the executed result of instruction i1, the data which is already in memory stage 20 may be used. Therefore, tristate buffer 23 is rendered conductive and tristate buffers 41 and 14 are rendered non-conductive. Consequently, the resultant data of logic circuit 22 is read out through tristate buffer 23 to supply bus 43 and stored in register 11 of execution stage 10. The bypass circuit can thus transfer data in memory stage 20 of instruction i1 to execution stage 10 of instruction i3.
Supply bus 44 transfers data extracted from execution stage 10 through tristate buffer 15 to register 12 in execution stage 10, just as supply bus 43. Supply bus 44 also transfers data extracted from memory stage 20 through tristate buffer 24 to execution stage 10.
The above description relates to a microprocessor including one functional unit 1. Now, a microprocessor including a plurality of functional units will be described in detail.
One example of a microprocessor including a plurality of functional units is a VLIW (Very Long Instruction Word) machine which is a kind of super scalar processors. A microprocessor including a plurality of functional units can process a plurality of instructions at the same time. FIGS. 11 and 12 show possible layouts of interconnections coupling one register file with a plurality of functional units in such a microprocessor including a plurality of functional units. The layouts of FIGS. 11 and 12 are shown only to clarify the objects of the present invention, and they are not the admitted prior art.
FIG. 11 shows a possible layout of interconnections coupling one register file with a plurality of unpipelined functional units.
Referring to FIG. 11, the microprocessor includes one register file 5 and four functional units 1-4. Each functional unit consists of 32 processing circuits. Functional unit 1 consists of 0th-31st bit portions 100-131. Functional unit 2 consists of 0th-31st bit portions 200-231. Functional unit 3 consists of 0th-31st bit portions 300-331. Function unit 4 consists of 0th-31st bit portions 400-431.
In order to process the corresponding 1-bit data in register file 5, each bit portion is coupled to the bit portion of register file 5 in which the data is stored. 0th bit portion 500 of register file 5 is connected to 0th bit portions 100, 200, 300 and 400 of the four functional units 1-4 through supply buses 43 and 44. First bit portion 501 of register file 5 is connected to first bit portions 101, 201, 301 and 401 of the four functional units 1-4 through supply buses 43 and 44. The 31st bit portion 531 of register file 5 is connected to 31st bit portions 131, 231, 331, and 431 of the four functional units 1-4 through supply buses 43 and 44. The second through 30th bit portions (not shown) of register file 5 are connected in a similar manner. Only the 2-bit supply buses 43 and 44 are shown in FIG. 11, and the bus corresponding to result bus 45 in FIG. 10 is not shown here. As apparent from FIG. 11, such an interconnecting method would require an enormous number of interconnections and a complicated interconnection layout.
FIG. 12 shows a possible layout of interconnections for connecting one register file with a plurality of functional units in a pipelined microprocessor.
Referring to FIG. 12, the microprocessor includes one register file 5; four functional units 1-4; and bypass circuits 50a, b-53a, b; 54-57; 58a, b-61a, b; 62-65; and 66a, b-81a, b for connecting register file 5 with functional units 1-4 and functional units 1-4 with one another.
The 0th bit portion 500 of register file 5 and 0th bit portions 100-400 of functional units 1-4 are connected through supply buses 50a, b-53a, b. Functional units 1-4 are connected with one another through extraction buses 58a, b-61a, b, transfer buses 54-57, and supply buses 50b-53b.
For example, when data is to be transferred from register file 5 to functional units 1-4, tristate buffers 62-65 are rendered conductive. As a result, data in 0th bit portion 500 of register file 5 is transferred to 0th bit portion 100 of functional unit 1 through supply bus 50a, tristate buffer 62, and supply bus 50b. Data in 0th bit portion 500 is transferred to 0th bit portion 200 of functional unit 2 through supply bus 51a, tristate buffer 63, and supply bus 51b. Data is transferred to functional units 3 and 4 in a similar manner.
As another example, if data is to be transferred from the execution stage of functional unit 1 to functional unit 2, only tristate buffer 67a is rendered conductive, whereby data in the execution stage of 0th bit portion 100 is transferred to 0th bit portion 200 of functional unit 2 through extraction bus 58a, tristate buffer 67a, transfer bus 55 and supply bus 51b.
If data is to be transferred from the execution stage of 0th bit portion 200 in functional unit 2 to 0th bit portion 100 of functional unit 1, only tristate buffer 70a is rendered conductive, and data is transferred from the execution stage of 0th bit portion 200 in functional unit 2 to 0th bit portion 100 of functional unit 1 through extraction bus 59a, tristate buffer 70a, transfer bus 54 and supply bus 50b.
When data is to be transferred from the memory stage in 0th bit portion 100 of functional unit 1 to 0th bit portion 300 of functional unit 3, only tristate buffer 68b is rendered conductive, and data is transferred from the memory stage in 0th bit portion 100 of functional unit 1 to 0th bit portion 300 of functional unit 3 through extraction bus 58b, tristate buffer 68b, transfer bus 56 and supply bus 52b.
When data is to be transferred from the execution stage in 0th bit portion 100 of functional unit 1 to 0th bit portion 100 itself, only tristate buffer 66a is rendered conductive, and data is transferred from execution stage in 0th bit portion 100 through extraction bus 58a, tristate buffer 66a, transfer bus 54 and supply bus 50b to 0th bit portion 100.
Interconnections for transferring data from first bit portion 501 of register file 5 to first bit portions 101, 201, 301 and 401 of functional units 1-4 are not shown in FIG. 12. Interconnections for transferring data among first bit portions 101, 201, 301 and 401 of functional units 1-4 are not shown. Also, interconnections for transferring data from other bit portions of register file 5 to other bit portions of functional units 1-4, and interconnections for transferring data among other bit portions of functional units 1-4 are not shown. Only interconnections for supplying a first source operand to each bit portion are shown, and interconnections for supplying a second source operand are not shown. Although widths of bit portions 100-131, 200-231, 300-331 and 400-431 are shown to be partially unequal, all widths are equal.
Here, a silicon area S required for a bypass circuit can be expressed by the following equation (1) EQU S=(number of functional units.times.number of source operands.times.number of bits).times.occupied area per interconnection (1)
The value obtained by dividing silicon area S by an occupied area per interconnection is defined hereinafter as "interconnection cost". Since this microprocessor has four functional units with two source operands and 32 bits, interconnection cost thereof will be 256.
Meanwhile, in a microprocessor including one functional unit, supply buses 43 and 44 are required for coupling register file 5 and functional unit 1. However, these supply buses 43 and 44 can be formed on other circuits if multi-layered interconnection technology for LSIs is used, because widths thereof are sufficiently smaller than that of one bit portion, so that interconnection cost will be zero.