1. Field of the Invention
The present invention relates to a program parallel execution method in a parallel processor system and, more particularly, to a multi-thread execution method of executing a single program which is divided into a plurality of threads in parallel to each other by a plurality of processors and a parallel processor system.
2. Description of the Related Art
Among methods of processing a single program in parallel by a parallel processor system is a multi-thread execution method of executing a program divided into instruction streams called threads in parallel to each other by a plurality of processors. The method is recited, for example, in Japanese Patent Laying-Open (Kokai) No. Heisei 10-27108 (hereinafter referred to as Literature 1), “Proposal of On Chip Multiprocessor Oriented Control Parallel Architecture MUSCAT” (Parallel Processing Symposium JSPP97 Articles, Japanese Society of Information Processing Engineers of Japan, pp. 229-236, May 1997) (hereinafter referred to as Literature 2) and Japanese Patent Laying-Open (Kokai) No. Heisei 10-78880 (hereinafter referred to as Literature 3). In the following, the conventional multi-thread execution methods recited in the literatures will be described.
It is a common practice in a multi-thread execution method to call generating a new thread on other processor as forking a thread and call a thread on the side which conducts forking operation as a master thread, a newly generated thread as a slave thread, a point where a thread is forked as a fork point and a top portion of a slave thread as a fork destination address or a start point of a slave thread. In the Literatures 1 to 3, a fork instruction is inserted at a fork point in order to give an instruction to conduct thread forking. The fork instruction has designation of a fork destination address, so that execution of the fork instruction generates a slave thread starting at the fork destination address on other processor to start execution of the slave thread. In addition, an instruction called a term instruction which terminates processing of a thread is prepared, so that each processor ends processing of a thread by the execution of the term instruction.
FIG. 24 shows outlines of processing of a conventional multi-thread execution method. FIG. 24(a) shows a single program divided into three threads A, B and C. When processing the program by a single processor, one processor PE sequentially processes the threads A, B and C as shown in FIG. 24(b). On the other hand, by the multi-thread execution methods recited in the Literatures 1 to 3, one processor PE1 executes the thread A and while the processor PE1 executes the thread A, the thread B is generated in other processor PE2 by a fork instruction buried in the thread A and the processor PE2 executes the thread B as shown in FIG. 24(c). The processor PE2 also generates the thread C in a processor PE3 according to a fork instruction buried in the thread B. The processors PE1 and PE2 end processing of the threads according to term instructions buried immediately before start points of the threads B and C, respectively, and when executing the last instruction of the thread C, the processor PE3 executes the subsequent instruction (system call instruction in general). By thus simultaneously executing the threads by a plurality of processors in parallel to each other, higher performance can be obtained than that of sequential processing.
As another conventional multi-thread execution method, there exists a multi-thread execution method of generating the thread B in the processor PE2 and the thread C in the processor PE3, respectively, by conducting a plurality of times of forking from the processor PE1 which executes the thread A as shown in FIG. 24(d). In contrast to the model shown in FIG. 24(d), a multi-thread execution method in which such a constraint is imposed on the model as shown in FIG. 24(c) that a thread is allowed to generate a valid slave thread only once during its existence is referred to as one fork model. The one fork model enables drastic simplification of thread management and realizes a thread controller as hardware on a practical hardware scale. The present invention is premised on such one fork model.
Here, when there exists none of free processors that allow generation of a slave thread at the time of a fork instruction, either one of the following two methods is conventionally adopted.
(1) A processor executing a master thread waits which allows generation of a slave thread appears.
(2) A processor executing a master thread continues subsequent processing of the master thread while preserving the contents of a register file (a fork destination address and register contents) at a fork point in a physical register on the backside. The contents of the register file preserved in the backside physical register will be referred to to generate a slave thread at a time point where a free processor which allows generation of a slave thread appears.
For a master thread to generate a slave thread and make the slave thread to conduct predetermined processing, it is necessary to hand over a value of at least a register necessary for the slave thread among registers in the register file at a fork point of the master thread from the master thread to the slave thread. In order to reduce costs of data handover between the threads, the methods recited in the Literatures 2 and 3 have a mechanism for handing over a register value at the time of thread generation as hardware. This mechanism is to copy all the contents of the register file of the master thread to the slave thread at the time of thread generation. After the slave thread is generated, change of register values of the master thread and the slave thread will be made independently from each other, so that no data will be handed over between the threads using the register. Proposed as another conventional technique related to data handover between threads is a parallel processor system having a mechanism of individually transferring register values on a register basis according to an instruction.
Other than those mentioned above, in the MUSCAT recited in the Literature 2, numbers of dedicated instructions are prepared for flexibly controlling parallel operation of threads such as a synchronization instruction between threads.
As described above, in the conventional multi-thread execution methods, when none of free processors exists that allow generation of a slave thread, execution of a fork instruction of the master thread is kept waiting until a free processor appears, whereby execution might be kept waiting so long depending on circumstances to extremely reduce processing efficiency.
In order to improve reduction in processing efficiency caused by the processing stop, the method which enables processing of a master thread to be continued by preserving the contents of a register file at a fork point in a physical register on the backside requires at least two register files at the front and the back sides for each processor. Assume, for example, that 32 32-bit registers are accommodated in one register file, 32×32 bits of memory is required for each processor, resulting in an increase not negligible in an amount of hardware in an on chip parallel processor in which a number n of processors are integrated in one chip. Moreover, not only simple increase in an amount of hardware but also the need of saving and restoring a physical register on the backside together with a physical register on the front side at the time of process switch by an operating system (OS) leads to an increase in the volume of processing at the time of process switching to invite degradation of the performance due to an increase in overhead at the time of process switching.
As described above, in the conventional multi-thread execution methods, ending a thread needs description of a term instruction immediately before a slave thread start point without fail. Because one term instruction is required for one thread, the smaller the number of instructions a thread includes, the higher a rate of term instructions in the total number of instructions becomes. Since as well as other instructions, a term instruction is stored in instruction memory to become an object to be fetched, the problem of degradation of processing performance is caused due to an increase in an amount of hardware of the instruction memory and due to an increase in the number of instruction fetches.