The present invention relates to a high-performance multi-processor system which executes a plurality of instructions simultaneously, more particularly to a multi-thread execution method for the multi-processor system.
A technique, to achieve a high-speed operation of processors, has been put to practical use, in which instructions are simultaneously issued to a plurality of arithmetic units installed therein, utilizing a parallelism of instruction unit, in order to increase a processing speed. With this technique, it will be ideally possible to execute a plurality of instructions every clock cycle.
However, there are dependencies between the instructions. If antecedent instruction is not completed, subsequent instructions sometimes can not be executed. Therefore, the number of the instructions to be simultaneously executed is limited. Moreover, phenomenon that instructions are not smoothly supplied to the arithmetic unit may occur, due to a conditional branch instruction. From such circumferences, if there are infinite number of the arithmetic units, an increase in performance of the system is actually suppressed to only three or four times.
Such kind of the limitation in the increase in the performance of the system is described (Monica Lam and Robert Willson: "Limits of Control Flow on Parallelism", The 19th International Symposium on Computer Architecture, IEEE Computer Society Press, pp.46-57, 1992).
Taking account of this limitation, as means for further increasing the performance of the system, the following technologies have been proposed.
(1) In order to further exploit parallelism for the instruction unit, an out-of-order execution mechanism and a register renaming mechanism are introduced thereby lessening the dependence relationship between the instructions.
(2) A program is divided into a plurality of threads, and the parallel processing is executed in accordance with the level of the threads.
The out-of-order execution mechanism in the item (1) is one that executes the instruction first which becomes capable of being executed, regardless of the execution order of the program. To achieve this, it is necessary to resolve anti-dependencies introduced by lack of the registers at the time of allocation of the registers even if data dependencies have been resolved. For solving the anti-dependencies, a register renaming mechanism dynamically changes the name of the register designated by software to another name.
For example, it is assumed that the instructions are issued according to the following program,
0x10: add r1.rarw.r2+r3 PA1 0x14: sub r4.rarw.r1-r5 PA1 0x14: add r5.rarw.r6+r7 PA1 0x1c: sub r8.rarw.r4-r5
Here, since the sub instruction of the address "0x14" uses the result of add instruction (r1 register) of address "0x10", the data dependence will exist for a register "r1". Similarly, the sub instruction of address "0x1c" has the dependence with the add instruction of the address "0x18". Although there are no data dependence between the sub instruction of the address "0x14" and the add instruction of the address "0x18", there exists an anti-dependence for a register "r5". Therefore, the add instruction of the address "0x18" cannot be executed before the completion of the sub instruction of the address "0x14". In such case, the register renaming mechanism renames the register "r5" of the address "0x14" and the register "r5" of the address "0x18" with other names, whereby the register renaming mechanism makes the out-of-order execution possible.
However, even if such register renaming mechanism is used, there are only 16 through 32 instructions at most for the out-of-order execution. There often happens the case where a large number of the instructions for which the out-of-order execution can be performed are not present. Moreover, when this range is increased to the actual extent, there are many instructions which can not be executed due to the foregoing dependence. Therefore, an increase in performance matching with an increase in the quantity of the hardware cannot be expected.
On the other hand, the parallel processing system with the level of the thread is the method in which instructions are not executed unit by unit but instructions by a plurality of thread are in parallel executed whereby the arithmetic unit is utilized more effectively, resulting in an increase in processing speed. According to this method, in general, little dependence is present between the threads, so that an increase in performance can be easily achieved than in the case of the parallel processing with the foregoing instruction level.
At the time when a processing speed for the single task is increased in a parallel processing at the thread level, generation of the threads with a high efficiency and data delivery between the threads are essential. As an example of parallel processing processors for fine threads, the paper is mentioned (Gurinder Sohi, Scott Breach and T. Vijaykumar: "Multiscalar Processor", The 22nd International Symposium on Computer Architecture, IEEE Computer Society Press, pp.414-425, 1995).
In the Multiscalar Processor, a single program is divided into "tasks", each of which is an aggregation of basic blocks, and the "tasks" are processed by a processor capable of performing parallel execution. The delivery of the register contents between the tasks is designated by a task descriptor which is generated by a task generation compiler. The task descriptor explicitly designates the register which is to be generated. This designation is called a create mask. Moreover, forward bits are added to the instruction for updating the register which is finally designated to the create mask. As described above, the Multiscalar Processor performs the parallel execution according to codes depending on compiler analysis ability.
However, in case of conversion of the conventional code to thread level parallel processing or in case of codes which are difficult to be subjected to dependency analysis, the Multiscalar Processor has no ability to enhance its performance. Moreover, the problem of an increase in a code size by the task descriptor is produced. Since the Multiscalar Processor does not correspond to the out-of-order execution, the Multiscalar Processor cannot enhance the performance by virtue of the existing instruction level parallel processing so that an enhancement in the performance is limited compared to the conventional technology.
As described above, in the conventional thread generation technology for the thread level parallel processing, the contents of the registers are inherited explicitly or through the memory. Therefore, it is required to describe register dependency or to inherit data to newly generated threads using load/store instructions. Therefore, for generation of threads, instructions for data inheritance must be inserted into both thread-generating threads and generated threads.
Moreover, the out-of-order execution type processor performs an in-order execution for synchronous instructions in order to keep the validity. In this case, a decrease in performance is remarkable. Therefore, the threads are generated according to "fork" instructions. In case of fine threads, when an increase in the processing speed is intended, it is required to perform the out-of-order execution for instructions before and after issuing of the fork instruction.