1. Field of the Invention
The present invention relates to a parallel processor system for executing a plurality of threads which are obtained by dividing a single program in parallel to each other by a plurality of processors, and more particularly, to a method of taking a register updated in a master thread after forking over to a slave thread and a device therefor.
2. Description of the Related Art
Among methods of processing a single program in parallel by a parallel processor system is a multi-thread execution method of executing instruction streams called threads obtained by dividing a program in parallel to each other by a plurality of processors. Literatures reciting this method are, for example, Japanese Patent Laying-Open (Kokai) No. Heisei 10-27108 (hereinafter referred to as Literature 1), “Control Parallel On Chip Multiprocessor: MUSCAT” (Parallel Processing Symposium JSPP97 Articles, Japanese Society of Information Processing Engineers of Japan, pp. 229-236, May 1997) (hereinafter referred to as Literature 2), Japanese Patent Laying-Open (Kokai) No. Heisei 10-78880 (hereinafter referred to as Literature 3), “SKY: A Processor Architecture that Exploits Instruction-level Parallelism in Non-numeric Applications” (Parallel Processing Symposium JSPP98 Articles, Japanese Society of Information Processing Engineers of Japan, pp. 87-94, June 1998) (hereinafter referred to as Literature 4), and “Multiscalar Processor” (G. S. Sohi, S. E. Breach and T. N. Vijaykumar, the 22nd International Symposium on Computer Architecture, IEEE Computer Society Press, 1995, pp. 414-425) (hereinafter referred to as Literature 5). In the following, the conventional multi-thread execution methods recited in the Literatures will be described.
In general, generating a new thread on other processor in a multi-thread execution method is called “forking a thread” and a thread on the side which conducts forking operation is called a master thread, a newly generated thread is called a slave thread, a point where a thread is forked is called a fork point and a head portion of a slave thread is called a fork destination address or a start point of the slave thread. In the Literatures 1 to 4, a fork instruction is inserted at a fork point in order to give an instruction to conduct thread forking. The fork instruction has designation of a fork destination address, so that execution of the fork instruction generates a slave thread starting at the fork destination address on other processor to start execution of the slave thread. In addition, an instruction called a term instruction which terminates processing of a thread is prepared, so that each processor ends processing of a thread by the execution of the term instruction.
FIG. 12 shows outlines of processing of a multi-thread execution method. FIG. 12(a) shows a single program divided into three threads A, B and C. In a case of processing of the program by a single processor, one processor PE sequentially processes the threads A, B and C as shown in FIG. 12(b). On the other hand, as shown in FIG. 12(c), in the multi-thread execution methods recited in the Literatures 1 to 5, one processor PE1 executes the thread A and while the processor PE1 executes the thread A, the thread B is generated in other processor PE2 by a fork instruction buried in the thread A and the processor PE2 executes the thread B. The processor PE2 also generates the thread C in a processor PE3 according to a fork instruction buried in the thread B. The processors PE1 and PE2 end processing of the threads according to term instructions buried immediately before start points of the threads B and C, respectively, and when executing the last instruction of the thread C, the processor PE3 executes the subsequent instruction (system call instruction in general). By thus simultaneously executing the threads in parallel to each other by a plurality of processors, higher performance can be obtained than that of sequential processing.
As another conventional multi-thread execution method, there exists a multi-thread execution method of generating the thread B in the processor PE2 and the thread C in the processor PE3, respectively, by conducting a plurality of times of forking from the processor PE1 which executes the thread A as shown in FIG. 12(d). In contrast to the model shown in FIG. 12(d), the multi-thread execution method on which such a constraint is imposed as shown in FIG. 12(c) that a thread is allowed to generate a valid slave thread only once during its existence is referred to as one fork model. The one fork model enables thread management to be drastically simplified and realizes a thread controller as hardware on a practical hardware scale. Moreover, since an individual processor exclusively has one other processor that generates a slave thread, multi-thread execution is enabled by a parallel processor system in which adjacent processors are connected in a ring in a single direction. The present invention is premised on such one fork model.
When slave thread forking is made, register takeover from a master thread to a slave thread is necessary. The register takeover is conducted in two manners in general. One, as adopted in the parallel processor systems recited in the Literatures 1 to 3, is taking over only the contents of a register file of a master thread at the forking and not a register updated after forking, which will be referred to as register at forking transfer system hereinafter. The other, as adopted in the parallel processor systems recited in the Literatures 4 and 5, is taking over registers updated after forking as well. This will be referred to as post-forking register transfer system.
As shown in FIG. 13(a), for example, in a sequential execution program in which an instruction 1 to increment the value of a register r20 by one, an instruction 2 to call a function func, an instruction 3 to increment the value of the register r20 by one, an instruction 4 to call a function func and an instruction 5 to place the value obtained by incrementing the value of the register r20 by one at a register r13 are described in this order, when executing an instruction stream after the instruction 5 as a slave thread, a fork instruction is inserted at a time point where the value of the register r20 to which the slave thread refers is settled in the register at forking transfer system as shown in FIG. 13(b).
On the other hand, in the post-forking register transfer system, because a settled value of the register r20 is transferred to a slave thread after forking, slave thread forking can be conducted ahead without waiting for the value of the register r20 to be settled. It is accordingly possible, for example, to insert a fork instruction immediately ahead of the instruction 1 as shown in FIG. 13(c). This, however, inevitably invites a RAW (Read After Write) offense on a slave thread side, so that in the Literatures 4 and 5, a time point where a register necessary for a slave thread and its register value are settled is detected by static dependence analysis conducted by a compiler and a register transfer instruction is inserted immediately after a register to be transferred is defined or determined (Literature 4) or a register transfer bit is set in an instruction code (Literature 5), while a reception side waits for execution of an instruction until receiving the settled register value.
Although a multi-thread execution method is premised on that preceding threads whose execution is settled are basically executed in parallel, actual programs in many cases fail to obtain sufficient threads whose execution is settled. In addition, there is a possibility that desired performance could not be obtained because a parallelization rate is suppressed to be low due to limitations of dynamically determined dependence, compiler analysis capacity and the like. Literature 1 and the like therefore introduce control speculation to support speculative execution of a thread by hardware. In control speculation, a thread whose execution is highly probable is executed on speculation before the execution is settled. A thread at a speculation state is temporarily executed within a range where cancellation of the execution is possible in terms of hardware. A state where a slave thread is temporarily executed is referred to as a temporary execution state, and when a slave thread is at the temporary execution state, a master thread is regarded as being at a thread temporary generation state. In a slave thread at a temporary execution state, write to a shared memory is suppressed, while write is made to a temporary buffer provided separately. When speculation is determined to be right, a speculation success notification is issued from the master thread to the slave thread, whereby the slave thread reflects the contents of the temporary buffer in the shared memory to enter a normal state where no temporary buffer is used. In addition, the master thread enters a thread generation state out of the thread temporary generation state. On the other hand, when the speculation is determined to fail, a thread abort instruction is executed by the master thread and execution of the slave thread and the following threads is cancelled. In addition, the master thread enters a thread yet-to-be generated state out of the thread temporary generation state to again allow generation of a slave thread. In other words, in one fork mode, although thread generation is limited to one at most, when control speculation is conducted and fails, forking is again allowed. Also in this case, valid slave thread that can be generated is one at most.
In addition to those mentioned above, in the MUSCAT recited in the Literature 2, numerous dedicated instructions are prepared for flexibly controlling parallel operation of threads such as inter-thread synchronization instructions.
As described above, the post-forking register transfer system enables forking of a slave thread prior to the settlement of the value of a register necessary for the slave thread without waiting for the settlement and so much improves the degree of parallelization of instruction execution as compared to that of the register at forking transfer system. However, since a register updated in a master thread after forking is taken over to a slave thread, control should be made to prevent a RAW offense from occurring at the slave thread side. When realizing the control by the above-described methods which are recited in the Literatures 4 and 5, unnecessary synchronization occurs to degrade performance in some cases. The reason is that the methods intend to statically eliminate a RAW offense by dependence analysis at the time of compiling and to synchronize a master thread and a slave thread related to a register to be taken over to the slave thread. In the following, the problem will be described using a specific example.
Now, as shown in FIG. 14(a), assuming a sequential processing program having a block a including an update instruction of a register r10, a branch instruction b, a block c including the update instruction of the register r10, and a block d including an instruction to refer to the register r10, consideration will be given to a case of forking of the block d as a slave thread immediately before the block a. In this case, since the register r10 is referred to at the block d, the value of the register r10 should be taken over from the master thread to the slave thread. Although after a fork point, the register r10 is updated at the block a and the block c, since the block c is executed only when branch in response to the branch instruction b is realized, when the branch is realized, the value of the register r10 updated at the block c, and when the branch is not realized, the value of the register r10 updated at the block a, should be taken over to the slave thread, respectively. In such a case, according to the conventional methods recited in the Literature 5 and the like, an instruction to transfer a settled value of the register r10 to a slave thread should be inserted at a part where realization/non-realization of branch is settled as shown in FIG. 14(b). As a result, in actual program execution, an instruction to refer to the register r10 of the slave thread will be kept waiting for long irrespective of success/failure of branch in response to the branch instruction b. When branch is realized, since the value of the register r10 updated at the block c is referred to, such waiting is inevitably necessary, while when branch is not realized, since the value of the register r10 updated at the block a can be used without modification, such waiting would be unnecessary waiting for synchronization.