1. Field of the Invention
The present invention relates to a program parallel executing method in a parallel processing system, and more particularly to a multi-thread executing method and a parallel processing system for dividing a single program into a plurality of threads and executing the above program in parallel by a plurality of processors.
2. Description of the Related Art
As a method of processing a single program in parallel by a parallel processing system, there is a multi-thread executing method of dividing a program into instruction flows called as threads and executing the above program by a plurality of processors in parallel. As the articles describing this method, there are Japanese Patent Publication Laid-Open No. 10-27108 (hereinafter, referred to as article 1), “Suggestion of On-Chip Multiprocessor-Oriented Multi-Stream Control Architecture MUSCAT” (pp. 229–236, papers of Joint Symposium on Parallel Processing JSPP97, Information Processing Society of Japan, May 1997) (hereinafter, referred to as article 2), Japanese Patent Publication Laid-Open No. 10-78880 (hereinafter, referred to as article 3). The conventional technique described in these articles will be described below.
In a general multi-thread executing method, to generate a new thread on another processor is said as “fork a thread”, a thread on the side of performing the fork operation is called as a parent thread, a newly generated thread is called as a child thread, a position to fork a thread is called as a fork point, and a head position of a child thread is called as a fork destination address or a starting point of a child thread. In the articles 1 to 3, a fork instruction is inserted in a fork point in order to fork a thread. A fork destination address is specified in the fork instruction, a child thread starting from the fork destination address is generated in another processor by the execution of the fork instruction, and the execution of the child thread is started. Further, an instruction called as a term instruction for finishing the processing of a thread is prepared, and each processor finishes the processing of a thread by executing the term instruction.
FIG. 37 shows an outline of the processing of the conventional multi-thread executing method. FIG. 37(a) shows a single program divided into three threads A, B, and C. When a single processor processes the program, one processor PE sequentially processes the threads A, B, and C, as illustrated in FIG. 37(b). On the contrary, in the multi-thread executing method in the articles 1 to 3, one processor PE1 executes the thread A, and the thread B is generated in the other processor PE2 according to the fork instruction embedded in the thread A, while the processor PE1 is executing the thread A, and then the processor PE2 executes the thread B, as illustrated in FIG. 37(c). The processor PE2 generates the thread C in the processor PE3 according to the fork instruction embedded in the thread B. The processors PE1 and PE2 finish the processing of the threads according to the term instructions embedded just before the starting points of the respective threads B and C, and when the processor PE3 executes the last instruction of the thread C, it executes the next instruction (generally, a system call instruction). As mentioned above, simultaneous execution of threads in parallel by a plurality of processors can improve the performance, compared with the serial processing.
As the other conventional multi-thread executing method, as illustrated in FIG. 37(d), there is a multi-thread executing method in which the processor PE1 executing the thread A performs a plurality of times of fork, so to generate the thread B in the processor PE2 and the thread C in the processor PE3 respectively. Contrary to the model of FIG. 37(d), the multi-thread executing method, as illustrated in FIG. 37(c), which is restricted to only one generation of an effective child thread according to a thread during its existence, is called as a Fork-Once Parallel Execution model. The Fork-Once Parallel Execution model can simplify the thread management greatly and a thread controller can be realized by hardware on a realistic hardware scale. Since in the individual processors, the other processor of creating a child thread is restricted to one processor, a parallel processing system with the adjacent processors connected with each other in a single direction like a ring, enables multi-thread execution. The present invention is assumed to use this Fork-Once Parallel Execution model.
Here, when there is not a vacant processor that can generate a child thread at a time of a fork instruction, one of the following two methods is adopted hitherto.
(1) A processor executing a parent thread waits for the execution of a fork instruction until a vacant processor where a child thread can be generated appears.
(2) A processor executing a parent thread stores a fork destination address and the content of a register file at a fork point into a physical register on its backside and continues the following processing of the parent thread. The content of the fork destination address and the register file stored in the physical register on the back is referred when a vacant processor capable of creating a child thread appears, thereby creating the child thread.
In order to make a parent thread generate a child thread and make the child thread perform some predetermined processing, it is necessary to pass at least the values of a register required by the child thread, of the registers of the register file at a fork point of the parent thread, from the parent thread to the child thread. In order to reduce the cost of transferring data between the threads, the articles 2 and 3 are provided with a register-value inheritance mechanism at a time of thread generation by hardware. This means that all the content of the register file of the parent thread is copied into the child thread. After creating the child thread, the register values of the parent thread and the child thread vary independently, and the data transfer between the threads by using a register is not performed. As the other conventional technique about the data transfer between the threads, a parallel processing system provided with a mechanism for individually transferring the values of a register according to an instruction in every register is proposed.
Though the precedent threads whose execution has been decided are basically executed in parallel in the multi-thread executing method, there are many cases of failing to obtain enough executable threads in an actual program. There may be produced a probability that a desired performance cannot be obtained because the ratio of parallelism is cut down owing to dynamic decided dependencies and limit of complier analysis ability. Therefore, in the article 1, control speculation is introduced, thereby supporting the speculative execution of threads by hardware. In the control speculation, a thread having a high possibility of execution is executed in a speculative way before decision of execution. A speculating thread performs a temporary execution within a range capable of canceling the execution by hardware. Such a state that a child thread performs a temporary execution is called as a temporary executing state, and when a child thread is in a temporary execution state, it is said that a parent thread is in a thread temporary creating state. In a child thread in a temporary executing state, writing into a shared memory and a cache memory is restrained and writing into a temporary buffer separately provided is performed. When the rightness of the speculation is determined, a speculation success notice is issued from a parent thread to a child thread, and the child thread reflects the content of the temporary buffer on the shared memory and the cache memory, into a normal state with no use of the temporary buffer. The parent thread turns from a thread temporarily creating state into a thread creating state. On the other hand, when a failure in the speculation is determined, a thread abort instruction (abort) is executed in a parent thread, and the execution of the child thread and the later is cancelled. The parent thread turns from a thread temporarily creating state to a thread non-generated state, thereby to be able to generate a child thread again. Namely, the Fork-Once Parallel Execution model limits a thread to generate at most one thread, if performing the control speculation and failing in the speculation, a fork becomes possible again. Also in this case, the effective child thread is at most one.
In the MUSCAT described in the article 2, many exclusive instructions for flexibly controlling the parallel operation of threads, such as a sync instruction between threads, are prepared.
In order to realize the multi-thread execution of the Fork-Once Parallel Execution model in which a thread generates at most one effective child thread during its existence, as described in the article 2, all the threads are restricted to be the instruction codes of respectively executing a valid fork only once, in a compile stage of creating a parallel program from a serial processing program. In short, the Fork-Once limitation is statically assured on the parallel program.
However, it is difficult for a complier to keep the above limitation of Fork-Once Parallel Execution, because of the problems such as divided compile and a function call. In the conventional multi-thread executing method and parallel processing system, a parallel program which fails to keep the above limitation cannot run properly. For example, in a program including a main function and a func function as illustrated in FIG. 38 and FIG. 39, if a fork instruction is inserted both in the main function and the func function as illustrated in FIG. 38, the above limitation of Fork-Once Parallel Execution is being kept when a control flow branching from a block a to a block b is executed. However, when a control flow branching from the block a to a block c is executed, a fork is twice performed from the same thread and therefore, the limitation of Fork-Once Parallel Execution is not assured, thereby disturbing the normal run of the program. Therefore, it has been necessary to assure the Fork-Once limitation by inserting a fork instruction into only one of the main function and the func function, in a compile stage. FIG. 39 shows an example of a parallel program in which the fork instruction is inserted only in the func function and the precedent execution of the block d at a time of the block a in the main function is abandoned.