1. Field of the Invention
The present invention relates to a parallel processor system for executing a plurality of threads which are obtained by dividing a single program in parallel to each other by a plurality of processors, and more particularly, to a parallel processor system enabling forking to a processor having a thread being terminated and yet to be settled.
2. Description of the Related Art
Among methods of processing a single program in parallel by a parallel processor system is a multi-thread execution method of executing instruction streams called threads obtained by dividing a program in parallel to each other by a plurality of processors. Literatures reciting this method are, for example, Japanese Patent Laying-Open No. 10-27108 (hereinafter referred to as Literature 1), “Proposal of On Chip Multiprocessor Oriented Control Parallel Architecture MUSCAT” (Proceedings of the Parallel Processing Symposium JSPP97, Japanese Society of Information Processing Engineers of Japan, pp. 229–236, May 1997) (hereinafter referred to as Literature 2), Japanese Patent Laying-Open No. 10-78880 (hereinafter referred to as Literature 3), “Processor Architecture SKY Using Interthread Instruction Level Parallelism of Non-numeric Calculation Program” (Proceedings of the Parallel Processing Symposium JSPP98, Japanese Society of Information Processing Engineers of Japan, pp. 87–94, June 1998) (hereinafter referred to as Literature 4), and “Multiscalar Processor” (G. S. Sohi, S. E. Breach and T. N. Vijaykumar, the 22nd International Symposium on Computer Architecture, IEEE Computer Society Press, 1995, pp. 414–425) (hereinafter referred to as Literature 5). In the following, the conventional techniques recited in the Literatures will be described.
In general, generating a new thread on other processor in a multi-thread execution method is called “forking a thread” and a thread on the side which conducts forking operation is called a master thread, a newly generated thread is called a slave thread, a point where a thread is forked is called a fork point and a head portion of a slave thread is called a fork destination address or a start point of the slave thread. In the Literatures 1 to 4, a fork instruction is inserted at a fork point in order to give an instruction to conduct thread forking. The fork instruction has designation of a fork destination address, so that execution of the fork instruction generates a slave thread starting at the fork destination address on other processor to start execution of the slave thread. In addition, an instruction called a term instruction for terminating processing of a thread is prepared, so that each processor terminates processing of a thread by the execution of the term instruction.
FIG. 15 shows outlines of processing of a multi-thread execution method. FIG. 15(a) shows a single program divided into three threads A, B and C. In a case of processing of the program by a single processor, one processor PE sequentially processes the threads A, B and C as shown in FIG. 15(b). On the other hand, as shown in FIG. 15(c), in the multi-thread execution methods recited in the Literatures 1 to 5, one processor PE1 executes the thread A and while the processor PE1 executes the thread A, the thread B is generated in other processor PE2 by a fork instruction buried in the thread A, so that the thread B is executed in the processor PE2. The processor PE2 also generates the thread C in a processor PE3 according to a fork instruction buried in the thread B. The processors PE1 and PE2 terminate processing of the threads according to term instructions buried immediately before the start points of the threads B and C, respectively, and after executing the last instruction of the thread C, the processor PE3 executes the subsequent instruction (system call instruction in general). By thus simultaneously executing the threads in parallel to each other by a plurality of processors, higher performance can be obtained than that of sequential processing.
As another conventional multi-thread execution method, there exists a multi-thread execution method of generating the thread B in the processor PE2 and the thread C in the processor PE3, respectively, by conducting a plurality of times of forking from the processor PE1 which executes the thread A as shown in FIG. 15(d). In contrast to the model shown in FIG. 15(d), the multi-thread execution method on which such a constraint is imposed as shown in FIG. 15(c) that a thread is allowed to generate a valid slave thread only once during its existence is called as a Fork-Once Parallel Execution model. The present invention is premised on such Fork-Once Parallel Execution model. The Fork-Once Parallel Execution model enables thread management to be drastically simplified and realizes a thread management unit as hardware on a practical hardware scale. Moreover, an individual processor exclusively has one other processor that generates a slave thread. In the Literatures 1 to 4, therefore, multi-thread execution is realized using a parallel processor system in which adjacent processors are connected in a ring in a single direction.
In a conventional parallel processor system, an individual processor is managed in two kinds of states, a free state and a busy state. Free state is a state where processor resources are released to be ready for starting execution of a new thread at any time. In a parallel processor system in which processors are connected to each other in a ring in a single direction, when a fork request is made by a certain processor, only if its adjacent processor is at the free state, forking of a slave thread is conducted. When a processor at the free state starts execution of a thread, it transits to the busy state and when the execution of the thread is completed to obtain a stop permission at a thread management unit, it returns to the free state. The reason why obtaining a stop permission at the thread management unit is assumed to be a condition of returning to the free state is that with such a constraint imposed that at the parallel execution of a plurality of threads having a sequential execution order relationship, a slave thread is not allowed to stop unless a master thread stops, the thread management unit for controlling thread generation and stop ensures the constraint.
When slave thread forking is made, register takeover from a master thread to a slave thread is necessary. The register takeover is conducted in two manners in general. One, as adopted in the parallel processor systems recited in the Literatures 1 to 3, is taking over only the contents of a register file of a master thread as of the forking and not the contents of a register updated after forking. The other, as adopted in the parallel processor systems recited in the Literatures 4 and 5, is taking over registers updated after forking as well. The former is called as a fork-time register-values transfer method, and the latter is called as an after-fork register-values transfer method.
Although a multi-thread execution method is premised on that preceding threads whose execution is settled are executed in parallel, actual programs in many cases fail to obtain sufficient threads whose execution is settled. In addition, there is a possibility that desired performance could not be obtained because a rate of parallelization is suppressed to be low due to such limitations as dynamically determined dependence, compiler analysis capability and the like. Therefore, the thread parallel processing adopts thread-basis non-program order execution which enables program execution to be speeded up by, in consideration of a memory dependence relationship derived from an order relationship among threads, while ensuring proper program execution results, executing threads in parallel to each other irrespective of the order relationship among threads.
Thread-basis non-program order execution as well requires a dependence relationship among instructions included in a thread to be eliminated or ensured in order to obtain results of proper program execution. However, there is a problem that as well as instruction-basis non-program order execution, with respect to a normal dependence relationship regarding a memory in particular, instructions should be basically executed according to a program order and when the program order execution is deterministically made, the effect of improving execution performance of non-program order execution can not be fully obtained. In particular, thread-basis non-program order execution has a serious problem because non-program order execution is prevented on a basis of a thread composed of a plurality of instructions. An effective solution to this problem is data dependent speculative execution as well as instruction-basis non-program order execution. In other words, effective is thread-basis data dependent speculative execution in which a thread is speculatively executed according to a non-program order before determination is made whether a normal dependence relationship among instructions included in a thread exists or not, assuming that no normal dependence relationship exists.
On the other hand, with respect to a reverse dependence relationship and an output dependence relationship regarding a memory, by, for example, temporarily storing data to be written by a store instruction in a buffer or a memory inherent to a processor, the reverse dependence relationship or the output dependence relationship can be eliminated to enable non-program order execution as is done in instruction-basis non-program order execution.
For example, regarding thread parallel processing conducted in a parallel processor system composed of a plurality of processors each having an inherent cache memory, Japanese Patent No. 3139392 (hereinafter referred to as Literature 6), for example, discloses a cache memory control system which eliminates a reverse dependence relationship and an output dependence relationship regarding a memory. One example of a cache memory control system coping with a normal dependence relationship in addition to a reverse dependence relationship and an output dependence relationship regarding a memory is “Speculative Versioning Cache”, by S. Gopal, T. N. Vijaykumar, J. E. Smith, G. S. Sohi et al., Proceedings of the 4th International Symposium on High-Performance Computer Architecture, February 1998 (hereinafter referred to as Literature 7).
In addition to those mentioned above, in the MUSCAT recited in the Literature 2, numerous dedicated instructions are prepared for flexibly controlling parallel operation of threads such as synchronization instructions between threads.
While a parallel processor system in which adjacent processors are connected in a ring in a single direction has an advantage of simplifying hardware, it has a disadvantage of failing to use processor resources effectively when a particle size of a thread varies to result in degrading a degree of parallelization of threads. FIGS. 16 and 17 show such a case. As shown in FIG. 16, when a particle size of a thread is relatively small and substantially uniform, slave thread forking is sequentially conducted in the order of a thread th0, a thread th1, a thread th2 and a thread th3, so that it is highly probable that at a time point where a last processor PE3 conducts slave thread forking, its adjacent processor PE0 is at the free state. Forking of a slave thread th4 from the thread th3 is therefore possible. Similarly possible is forking of a slave thread th5 to the adjacent processor PE1 from the thread th4 to ensure a high degree of parallelization. When the particle size of the thread th0, for example, is larger than that of other thread, however, at a time point where the processor PE3 conducts slave thread forking, its adjacent processor PE0 is still at the busy state executing the thread th0, so that no forking is possible to decrease the degree of parallelization of threads as shown in FIG. 17.
On the other hand, the parallelization processor system shown in FIG. 3 of the Literature 1 adopts a structure in which a plurality of processors are connected to each other through a common bus so as not to limit, to its adjacent processor, other processor to which slave thread forking is made from an individual processor. In the system, state of an individual processor is managed in the free state and the busy state to select a processor to which a slave thread is forked from among processors at the free state and moreover, such a processor as the processor PE1 of FIG. 17 which executes the thread th1 whose master thread th0 is yet to be completed is managed to be at the busy state. It is therefore impossible to conduct slave thread forking from the thread th3 of the processor PE3 to the processor PE1.
In addition, in a case where the thread th1 is a speculative thread, how to handle processing results of the thread th1 poses a problem when releasing resources of the processor PE1. The reasons are that since the thread th1 has a possibility of cancellation by the master thread th0, a processing result of the thread th1 can not be written back to a main memory and that because such a slave thread of the thread th1 as the thread th2 needs to take over the processing result of the thread th1, the processing result of the thread th1 can not be cancelled.