1. Field of the Invention
The present invention relates to a program parallelizing apparatus, a program parallelizing method and a program parallelizing program for creating a parallelized program for a multithreading parallel processor from a sequential processing program.
2. Description of the Prior Art
As a method of processing a single sequential processing program in parallel in a parallel processor system, there has been known a multithreading method in which a program is divided into instruction streams called threads and executed in parallel by a plurality of processors. Reference is made to it in, for example, Japanese Patent Application laid open No. HEI10-27108 (hereinafter referred to as Reference 1), No. HEI10-78880 (Reference 2), No. 2003-029985 (Reference 3), No. 2003-029984 (Reference 4), and “Proposal for On Chip Multiprocessor-oriented Control Parallel Architecture MUSCAT”, Joint Symposium on Parallel Processing JSPP97, Information Processing Society of Japan, pp. 229-236, May 1997 (Reference 5). A parallel processor that executes multiple threads is called a multithreading parallel processor. In the following, a description will be given of conventional multithreading methods and multithreading parallel processors.
Generally, in a multithreading method and a multithreading parallel processor, to create a new thread on another processor is called “thread forking”. A thread that performs a fork is a parent thread, while a thread newly created from the parent thread is a child thread. The program location where a thread is forked will be referred to as a fork source address or a fork source point. The program location at the beginning of a child thread will be referred to as a fork destination address, a fork destination point, or a child thread start point. In the aforementioned References, a fork command is inserted at the fork source point to instruct the forking of a thread. The fork destination address is specified in the fork command. When the fork command is executed, child thread that starts at the fork destination address is created on another processor, and then the child thread is executed. A program location where the processing of a thread is to be ended is called a terminal (term) point, at which each processor finishes processing the thread.
FIG. 1 shows an outline of the processing conducted by a multithreading parallel processor in a multithreading method. FIG. 1 (a) shows a sequential processing program divided into three threads A, B and C. When the program is processed in a single processor, one processor element sequentially processes threads A, B and C as shown in FIG. 1 (b). In contrast, according to a multithreading method in a multithreading parallel processor described in the above References, as shown in FIG. 1 (c), thread A is executed by processor PE1, and, while processor PE1 is executing thread A, thread B is generated on another processor PE2 by a fork command embedded in thread A, and thread B is executed by processor PE2. Processor PE2 generates thread C on processor PE3 by a fork command embedded in thread B. Processors PE1 and PE2 finish processing the threads at terminal points immediately before the start points of threads B and C, respectively. Having executed the last command of thread C, processor PE3 executes the next command (usually a system call command). As just described, by concurrently executing threads in a plurality of processors, performance can be improved as compared with the sequential processing.
There is another multithreading method, as shown in FIG. 1 (d), in which forks are performed several times by the processor PE1 that is executing thread A to create threads B and C on processors PE2 and PE3, respectively. In contrast to the processing model or multithreading method of FIG. 1 (d), that of FIG. 1 (c) is restricted in such a manner that a thread can create a valid child thread only once while the thread is alive. This model is called a fork-one model. The fork-one model substantially simplifies the management of threads. Consequently, a thread managing unit can be implemented by hardware of practical scale. Further, each processor can create a child thread on only one other processor, and therefore, multithreading can be achieved by a parallel processor system in which adjacent processors are connected unidirectionally in a ring form.
There is a commonly known method that can be used in the case where no processor is available on which to create a child thread when a processor is to execute a fork command. That is, the processor waits to execute the fork command until a processor on which a child thread can be created becomes available. Besides, in Reference 4, there is described another method in which the processor invalidates or nullifies the fork command to continuously execute instructions subsequent to the fork command and then executes instructions of the child thread.
For a parent thread to create a child thread such that the child thread performs predetermined processing, the parent thread is required to pass to the child thread the value of a register, at least necessary for the child thread, in a register file at the fork point of the parent thread. To reduce the cost of data transfer between the threads, in References 2 and 6, a register value inheritance mechanism used at thread creation is provided through hardware. With this mechanism, the contents of the register file of a parent thread is entirely copied into a child thread at thread creation. After the child thread is produced, the register values of the parent and child threads are changed or modified independently of each other, and no data is transferred therebetween through registers. As another conventional technique concerning data passing between threads, there has been proposed a parallel processor system provided with a mechanism to individually transfer a register value for each register by a command.
In the multithreading method, basically, previous threads whose execution has been determined are executed in parallel. However, in actual programs, it is often the case that not enough threads can be obtained, whose execution has been determined. Additionally, the parallelization ratio may be low due to dynamically determined dependencies, limitation of the analytical capabilities of the compiler and the like, and desired performance cannot be achieved. Accordingly, in Reference 1, control speculation is adopted to support the speculative execution of threads through hardware. In the control speculation, threads with a high possibility of execution are speculatively executed before the execution is determined. The thread in the speculative state is temporarily executed to the extent that the execution can be cancelled via hardware. The state in which a child thread performs temporary execution is referred to as temporary execution state. When a child thread is in the temporary execution state, a parent thread is said to be in the temporary thread creation state. In the child thread in the temporary execution state, writing to a shared memory and a cache memory is restrained, and data is written to a temporary buffer additionally provided. When it is confirmed that the speculation is correct, the parent thread sends a speculation success notification to the child thread. The child thread reflects the contents of the temporary buffer in the shared memory and the cache memory, and then returns to the ordinary state in which the temporary buffer is not used. The parent thread changes from the temporary thread creation to thread creation state. On the other hand, when failure of the speculation is confirmed, the parent thread executes a thread abort command “abort” to cancel the execution of the child thread and subsequent threads. The parent thread changes from the temporary thread creation to non-thread creation state. Thereby, the parent thread can generate a child thread again. That is, in the fork-one model, although the thread creation can be carried out only once, if control speculation is performed and the speculation fails, a fork can be performed again. Also in this case, only one valid child thread can be produced.
To implement the multithreading of the fork-one model, in which a thread creates a valid child thread at most once in its lifetime, for example, the technique described in Reference 5 places restrictions on the compilation for creating a parallelized program from a sequential processing program so that every thread is to be a command code to perform a valid fork only once. In other words, the fork-once limit is statically guaranteed on the parallelized program. On the other hand, according to Reference 3, from a plurality of fork commands in a parent thread, one fork command to create a valid child thread is selected during the execution of the parent thread to thereby guarantee the fork-once limit at the time of program execution.
A description will now be given of the prior art to generate a parallel program for a parallel processor to implement multithreading.
As can be seen in FIG. 2, a conventional program parallelizing apparatus 10 receives a sequential processing program 13. A control/data flow analyzer 11 analyzes the control and data flow of the program 13. Based on the results of the analysis, a fork inserter 12 determines a basic block or a plurality of basic blocks as a unit or units of parallelization, that is, the locations of respective conditional branch instructions as candidate fork points. Referring to the analysis results of the data and control flow, the fork inserter 12 places a fork command at each fork point which leads to higher parallel execution performance. The fork inserter 12 divides the program into a plurality of threads to produce a parallelized program 14.
In conjunction with FIG. 2, a description has been given of the program parallelizing apparatus 10 which produces the parallelized program 14 from the sequential processing program 13 created by a sequential compiler. Further, as described in Japanese Patent Application laid open No. 2001-282549 (Reference 6), there is known another technique in which a program written in a high level language is processed to produce a target program for a multithreading parallel processor. Besides, due to the influence of program execution flow and memory dependencies which can be determined only at program execution time, the fork insertion method based on static analysis may not obtain desired parallel execution performance. To cope with the disadvantage, there has been employed a technique as described in Reference 6 in which fork points are determined by referring to profile information such as a conditional branch probability and a data dependence occurrence frequency at the time of sequential execution. Also in this case, the locations of conditional branch instructions are used as candidate fork points.
However, the prior art has a problem that, when fork points with better parallel execution performance are desired, the process to determine the fork points takes a longer time for the following reason. As the number of candidate fork points is increased to obtain fork points with better parallel execution performance, the time taken to determine an optimal combination of fork points becomes longer.