Along with an increase in integration density, the quantity of hardware, such as operation units, that can be mounted in a processor is increased. In a processor that several operations can be in parallel executed, such as a superscalar processor and a VLIW (very long instruction word) processor, several operation units are in parallel driven to enhance the processing performance. However, to maintain the parallel processing performance in such kind of processors, a register file with the multi-port structure that allows to be simultaneously supplied with data and to be simultaneously written of a result of operation according to the number of operation units driven simultaneously is required.
For example, R10000, a superscalar processor made by MIPS corp. employs a register file for integer operation that has 10 ports (7 read ports and 3 write ports) to enable the parallel execution of four instructions (two integer-operation instructions, one load/store instruction and one branch instruction).
When several superscalar processor elements can be mounted due to a further enhanced integration density, a mechanism that enables the high-speed access to common data between the processor elements is necessary to maintain the parallel processing performance. In this regard, a system that common data are left on a register file, without storing in a cache or main storage, to allow several processor elements to access them is effective. Such a system can be realized by increasing the number of ports of the register file, like the case of the superscalar processor.
FIG. 1 shows an example of a processor with four superscalar processor elements that can execute in parallel two operation instructions. Referring to FIG. 1, when all processor elements 601 to 604 use commonly data stored in a register file 605, the register file 605 has only to have 20 ports (16 for reading and 8 for writing) at the maximum because two operation units in each processor use two read ports and one write port of the register file 605.
In contrast with this, by restricting a register accessible from each of instruction to be in parallel executed, the number of ports of a register file can be decreased while maintaining the number of instruction to be in parallel executed.
FIG. 2 shows an example of a VLIW machine. Referring to FIG. 2, a instruction group 701 of four instructions executable in parallel is divided into two instruction groups 702, 703, each of which is of two instructions, and register files 704, 705 are assigned separately to processor elements 712, 713 to process these instruction groups. The instruction group 702 executes the operation by using operation units 706, 707 and accesses the register file 704. Similarly, the instruction group 703 executes the operation by using operation units 709, 710 and accesses the register file 705.
When the processor element 713 uses data stored in the register file 704, the data are transferred from the register file 704 through a selector 711 to the register file 705. The selector 711 is controlled to select the output result from the operation unit 710 for an ordinary operation instruction, and it is controlled to select the output of the register file 704 when the inter-register transfer instruction is executed. In like manner, a selector 708 is controlled by the inter-register transfer instruction from the register file 705 to the register file 704.
In such a composition, a register file with 6 ports (4 read ports and 2 write ports) has only to be provided for each instruction group (each processor element). Namely, the register file has only to have ports half as many as 12 ports (8 read ports and 4 write ports) required in the case that all the four instructions use commonly one register file.
For example, Japanese patent application laid-open No.5-233281(1993) discloses a high-performance calculator that enhances the separation between processor elements and facilitates the chip layout, by using such a technique.
In the composition shown in FIG. 1, the processor element can easily use common data with the other processor element and rapidly access data produced by the other processor element. However, in this composition, there is a problem that a scalable enhancement in performance cannot be obtained because the port number of register file, i.e., its delay and area, is increased with the number of operation units mounted on the processor element. Also, for a program, such as a program for image processing, that has a high instruction-independency and data-localization and uses few common data between processor elements, it is useless since the port number is more than is needed.
On the other hand, in the composition shown in FIG. 2, the port number of register file can be reduced, but it needs the operation to transfer data between register files when data to be used exists in a register file assigned to another processor element. This operation is conducted by the inter-register transfer instruction, therefore causing an overhead and thereby damaging high-speed access.