1. Field of the Invention
The present invention relates to a program conversion technique for a multithreading processor in which a plurality of processors execute a plurality of threads in parallel, and, more particularly, to a program parallelization device, a program parallelization method and a program parallelization program (compiler) for rapidly generating a parallelized program having high execution efficiency.
2. Description of the Related Art
One technique for processing single sequential processing programs in parallel with a parallel processor system is a multithreading method in which a program is divided into command streams called threads and executed in parallel by a plurality of processors (see, for example, Japanese Laid-open Patent (Kokai) No. Heisei 10-27108 (hereinafter referred to as Reference 1), Heisei 10-78880 (Reference 2) and 2003-029985 (Reference 3), and “Proposal for On Chip Multiprocessor-oriented Control Parallel Architecture MUSCAT” (Collection of Papers from the Parallel Processing Symposium JSPP97, Information Processing Society of Japan, pp. 229-236, May 1997) (Reference 4)). Conventional multithreading methods will be described below.
Typically, in the multithreading method, generating a new thread on another processor is called forking a thread, the thread that performed the fork operation is the parent thread, the newly generated thread is the child thread, the program location where the thread is forked is the fork point, and the program location at the beginning of the child thread is called the fork destination address or the start point of the child thread. In References 1-4, a fork command is inserted at the fork point to instruct the forking of a thread. The fork destination address is specified in the fork command, the child thread that starts at the fork destination address thereof by the execution of the fork command is generated on another processor, and the execution of the child thread is started. A program location where the processing of the thread is to be ended is called a terminal point, where each processor finishes processing the thread.
FIGS. 21, 22, 23 and 24 show overview of the processing in the multithreading method. FIG. 21 shows a sequential processing program divided into three threads A, B and C. When this program is processed with a single processor, one processor PE processes the threads A, B and C in turn as shown in FIG. 22. In contrast, in the multithreading methods in References 1-4, as shown in FIG. 23, a thread A is executed by one processor PE1, and, while the processor PE1 is executing the thread A, a thread B is generated on another processor PE2 by a fork command embedded in the thread A, and the thread B is executed by the processor PE2. Furthermore, the processor PE2 generates a thread C on the processor PE3 by a fork command embedded in the thread B. The processors PE1 and PE2 each finish processing the threads at the terminal points immediately before the start points of the threads B and C, and the processor PE 3 executes the last command of the thread C before executing the next command (usually a system call command). Thus, by simultaneously executing the threads in parallel with a plurality of processors, performance can be improved as compared with sequential processing.
There is another multithreading method, as shown in FIG. 24, in which forks are performed several times by the processor PE1, which is executing the thread A, to generate the thread B on the processor PE2, and the thread C on the processor PE3, respectively. In contrast to the model of FIG. 24, as shown in FIG. 23, a multithreading method constrained in such a way that a thread can generate a valid child thread at most once in its lifetime is called a one-time fork model. The one-time fork model substantially simplifies the management of threads, and achieves practical-scale hardware implementation of thread management. Furthermore, since for each processor the limit in terms of other processors on which it can generate a child thread is one processor, multithreading can be performed with a parallel processor system in which adjacent processors are connected unidirectionally in a ring form.
In order for a parent thread to generate a child thread and have the child thread perform a prescribed process, it is necessary that at least one register value from a register file at the fork point of the parent thread that is required by the child thread be delivered from the parent thread to the child thread. To reduce this cost of data delivery between threads, in References 2 and 4, a register value inheritance mechanism at thread generation time is provided through hardware, which copies the entire content in the register file of the parent thread to the child thread at thread generation time. After the child thread is generated, changes in the register values of the parent thread and the child thread are independent, and no passing of data using a register occurs between the threads. As another prior art for data delivery between threads, a parallel processor system has also been proposed, which has a mechanism for transferring register values separately on a register basis through a command.
Although the basics in multithreading methods are that the previous threads for which execution has been confirmed are executed in parallel, in actual programs, there are frequent cases where not enough threads are obtained, for which execution is confirmed. Furthermore, the percentage of parallelization is held down due to dynamically determined dependencies, limitation of analytical capabilities of the compiler and the like, raising the possibility that the desired performance may not be obtained. Thus, in Reference 1, control speculation is introduced to support speculative execution of threads through hardware. The control speculation speculatively executes threads with high possibility of execution before the execution is confirmed. The thread in a speculative state is temporarily executed to the extent that cancellation of execution is possible via hardware. A state in which the child thread performs temporary execution is called a temporary execution state, and when the child thread is in the temporary execution state, the parent thread is said to be in a temporary thread generation state. In the child thread in the temporary execution state, writing into shared memory and cache memory is restrained, and writing into an additionally provided temporary buffer is performed. When correctness of the speculation is confirmed, a speculation success notice is issued from the parent thread to the child thread, and the child thread reflects the content of the temporary buffer to the shared memory and the cache memory so as to be in a normal state without using the temporary buffer. In addition, the parent thread shifts from temporary thread generation state to thread generation state. On the other hand, if failure of the speculation is confirmed, a thread abort command is executed by the parent thread, and execution of the child and downstream threads is cancelled. In addition, the parent thread shifts from the temporary thread generation state to a non-thread generation state, such that generation of child thread is again possible. In other words, although generation of threads is limited to at most once in the one-time fork model, if control speculation is performed and the speculation fails, it is possible to fork again. Even in this case, at most one valid child thread is generated.
To achieve the multithreading of the one-time fork model in which a thread generates a valid child thread at most once in its lifetime, for example, in Reference 4 or the like, at the compilation step where a parallelized program is generated from a sequential processing program, the thread is limited to become a command code in which every thread executes a valid fork only once. That is to say, the one-time fork limitation is statically ensured in the parallelized program.
Meanwhile, in Reference 3, the one-time fork limitation is ensured at the time of execution of the program by selecting a fork command that generates a valid child thread among a plurality of fork commands present in the parent thread while the parent thread is executing.
Next, a conventional parallelized program generation device that is executed in the above-mentioned multithreading method will be described.
Referring to FIG. 25, in a conventional program parallelization device 10, a sequential processing program 13 is entered, a control/data flow analysis unit 11 analyzes the control flow and the data flow of the sequential processing program 13, then, based on the result, a fork insertion unit 12 divides a basic block or a plurality of basic blocks into units of parallelization, that is, threads, and inserts a fork command for parallelization to generate and output a parallelized program 14 divided into a plurality of threads.
The insertion point of the fork command (fork point) is determined so as to achieve higher parallel execution performance referring to the results of the analysis of the control flow and the data flow. However, sometimes the desired parallel execution performance cannot be achieved with the fork command insertion method based on static analysis due to the influence of program execution flow and memory dependencies that are only revealed at the time of program execution.
In contrast to this, a method is known, in which a fork point is determined with reference to profile information such as conditional branching probability and frequency of data dependency occurrence at the time of sequential execution (see, for example, Japanese Laid-open Patent (Kokai) No. 2001-282549 (hereinafter referred to as Reference 5)). According to this method, insertion of a more suitable fork command becomes possible by using the profile information that is dynamic information at the time of the sequential execution, thus, an improvement of parallel execution performance can be expected.
Incidentally, the above-mentioned prior arts have the following problems.
A first problem is that sometimes the desired parallel execution performance cannot be achieved even by executing conventional techniques owing to considerable room for improvement still present in the criteria for determining a fork point. The reason for this is that, while the criteria for determining a fork point is preferably the contribution of the fork command to the parallel execution performance, it is difficult to predict the performance at the time of the parallel execution accurately with an analytical method even if the profile information at the time of the sequential execution is used, such that sometimes a fork command may not be inserted appropriately.
One factor that makes analytically and highly accurately predicting the parallel execution performance difficult is the influence of memory dependence on the parallel execution performance. Since the influence of the memory dependence on the parallel execution performance varies intricately according to the parallelization scheme, even if the information on memory dependence is obtained from the profile information at the time of parallel execution, it is difficult to accurately evaluate its influence on the parallel execution performance. Furthermore, constraints of a parallelization method such as the one-time fork model, and hardware configuration such as inter-processor connection also have an influence on the parallel execution performance. However, similarly, since the influence on such parallel execution performance varies intricately according to the parallelization scheme, it is difficult to accurately evaluate the influence on the parallel execution performance from the profile information at the time of sequential execution with an analytical method.
Namely, there is the problem that, since it is difficult to accurately predict the parallel execution performance from the results of the analysis of the control flow and the data flow and the profile information obtained at the time of sequential execution, the fork command cannot be inserted appropriately, such that sometimes the desired parallel execution performance cannot be obtained.
A second problem is the longer time that is taken to determine the fork point, as attempts are made to obtain a fork point with a better parallel execution performance. The reason for this is that, in addition to the time taken to perform the evaluation becoming longer as the accuracy of the evaluation criteria for determining the fork point is improved, the number of fork points is usually high as compared with e.g., the number of loops contained in a program, thus the combination number thereof becomes enormous.