In recent years, performance of programs on single processor is approaching the limits. In order to improve performance, the clock frequency of the processor may be increased to increase processing volume per unit time, or instructions may be executed in parallel to increase the number of simultaneously executed processes.
An increase in clock frequency gives rise to a problem of an increase in power consumption. Also, there is a physical limit as to how much the clock frequency can be increased. Further, instruction level parallelism of general program is up to 2 to 4 (Non-Patent Document 1). Although parallelism may be increased by introducing speculative execution, such an increase is also known to have its own limits.
Against this background, attention has been focused on a method that parallelizes a program at a granularity coarser than the instruction level for execution by a plurality of processors to improve processing performance. There is no known standardized method, however, that converts a sequential program having a large number of control branches into a viable parallelized program.
Major program parallelization methods are a data-level parallelization method with a focus on loops and a speculative thread execution method with a focus on control.
Patent Document 1 discloses analyzing data dependence relations in a loop, divides an array, and allowing loop processes to be executed by plural processors. This method is effective when there are many regular loop processes such as numerical computations.
Patent Document 2 discloses focusing attention on branches appearing in a sequential program and converting these branches into speculative thread executions. Since this method parallelizes a program based on control flow, it may not be able to sufficiently extract the parallelism that potentially exists in the program. Further, multiprocessors having no mechanism for speculative thread execution may suffer a large rollback cost at the time of prediction failure. This method is thus not suitable for an application in which a rate of successful branch prediction is low.
Accordingly, it is preferable to provide a method that parallelizes a sequential program of a vast scale to generate a non-speculative multi-thread program (i.e., parallelized program) that effectively runs on multiprocessors. A parallelized program generated in such a manner may need to take into account an issue of waiting time caused by dependence relations between threads as will be described in the following.
Methods that control thread execution in a parallelized program may include a method of executing threads in parallel by calling procedures as asynchronous remote calls, a method of executing threads in parallel by transmitting messages indicative of start of execution to procedures, a method of executing threads by utilizing a shared memory between threads to exchange input/output variables, etc. In these methods, a first procedure (i.e., thread) may produce an execution result that is used by a second procedure. In such a case, an instruction to wait for the completion of the first procedure and an instruction to execute the second procedure may be arranged at proper locations in the program by taking into account the length of time required for executing other procedures and the like. If the first procedure is completed earlier than expected, a needless waiting time may occur until the start of execution of the second procedure.
FIG. 1 is a drawing for illustrating the occurrence of a needless waiting time. In FIG. 1, four processors PROCESSOR-0 through PROCESSOR-3 are used. PROCESSOR-0 executes a thread control program 1, which is a program for controlling each thread as to its execution and a wait for completion of execution. In an example illustrated in FIG. 1, PROCESSOR-0 successively requests PROCESSOR-1 through PROCESSOR-3 to execute procedures A through C (i.e., start A( ) to start C( )), respectively. PROCESSOR-0 then waits for the completion of procedure A (i.e., wait A( )) before requesting the execution of procedure D (i.e., start D( )) that is to use the result of execution of procedure A. PROCESSOR-0 then waits for the completion of procedure B (i.e., wait B( )) before requesting the execution of procedure E (i.e., start E( )) that is to use the result of execution of procedure B. PROCESSOR-0 then waits for the completion of procedure C (i.e., wait C( )) before requesting the execution of procedure F (i.e., start F( )) that is to use the result of execution of procedure C.
In this example, a wait occurs between the completion of procedure C and the request of execution of procedure F. This is because the wait for the completion of procedure B (i.e., wait B( )) and the request of execution of procedure E (i.e., start E( )) are situated before the wait for the completion of procedure C (i.e., wait C( )) and the request of execution of procedure F (i.e., start F( )) in the thread control program. Due to such instruction sequence, the wait for the completion of procedure C and the request of execution of procedure F are not performed until the completion of procedure B.
This instruction sequence is based on an expectation that procedure B will be completed before procedure C. If it is known in advance that procedure C will be completed before procedure B, the wait for the completion of procedure C and the request of execution of procedure F may be placed before the wait for the completion of procedure B and the request of execution of procedure E. In reality, however, the time required for procedure execution depends on the contents of processed data and the like, so that in many cases it may be impossible to accurately predict the completion time. Accordingly, the above-noted methods that utilize simplistic remote procedure calls, shared-memory-based threads, message transmissions, and the like may not be able to eliminate a waiting time as illustrated in FIG. 1.
The applicant of the present application has developed an asynchronous remote procedure call method with a dependence-relation-based wait, which specifies dependence relations between procedures as execution conditions on a procedure-specific basis. For the control of execution of threads in a parallelized program, the procedures are entered into an execution queue, and are executed upon their corresponding execution conditions being satisfied. Such a method is referred to as an asynchronous remote procedure call method with a dependence-relation-based wait.
FIG. 2 is a drawing illustrating the control of procedure execution by use of the asynchronous remote procedure call method with a dependence-relation-based wait. In FIG. 2, four processors PROCESSOR-0 through PROCESSOR-3 are used. PROCESSOR-0 executes a thread control program 2, which is a program for controlling each thread as to its execution and dependence relations. In so doing, PROCESSOR-0 executes a procedure call program 3 to control the procedures defined in the thread control program 2 by use of queues corresponding to the processors.
In the example illustrated in FIG. 2, procedure A is entered into an execution queue 4 of PROCESSOR-1 in accordance with the instruction “start A( )” in the control program 2. Further, procedure B is entered into an execution queue 5 of PROCESSOR-2 in accordance with the instruction “start B( )” in the control program 2. Moreover, procedure C is entered into an execution queue 6 of PROCESSOR-3 in accordance with the instruction “start C( )” in the control program 2.
Similarly, procedures D, E, and F are entered into the execution queues 4 through 6, respectively, in accordance with the instructions “start D( )”, “start E( )”, and “start F( )” in the control program 2. The thread control program 2 includes the instruction “dep(x, y, . . . )” that specifies dependence relations, and, in this instance, indicates that procedure x depends on procedure Y and others listed. Namely, this instruction specifies that the executions of procedure Y and others listed need to be completed before the execution of procedure X. In accordance with the instruction “dep(D, A)” in the control program 2, dependence of procedure D on procedure A is registered to the execution queue 4 of PROCESSOR-1. In accordance with the instruction “dep(E, A, B)” in the control program 2, further, dependence of procedure E on procedures A and B is registered to the execution queue 5 of PROCESSOR-2. In accordance with the instruction “dep(F, A, C)” in the control program 2, moreover, dependence of procedure F on procedures A and C is registered to the execution queue 6 of PROCESSOR-3.
In this manner, procedures entered into the execution queues provided for the respective processors are executed by corresponding processors in sequence as defined by positions in the queues. In so doing, procedures for which no dependency is registered (i.e., procedures indicated as “NULL” in FIG. 2) are unconditionally executed. Procedures for which dependency is registered are executed upon detecting the completion of execution of referenced procedures. The provision of a queue for each processor and the successive execution of procedures for which execution conditions are satisfied (i.e., executable procedures) make it possible to eliminate the waiting time as illustrated in FIG. 1, for example.
As described above, the use of the asynchronous remote procedure call method with a dependence-relation-based wait makes it possible to prevent the occurrence of a needless waiting time at the time of parallelized program execution, for example. Accordingly, when a sequential program of a vast scale is to be parallelized to generate a non-speculative parallelized program that effectively runs on multiprocessors, it is preferable to generate a parallelized program that is suitable for the asynchronous remote procedure call method with a dependence-relation-based wait as described above.
The applicant of the present application has already developed a parallelized program generation method that is applicable to the asynchronous remote procedure call method with a dependence-relation-based wait. In this parallelized program generation method, a sequence in which program instructions are executed is analyzed to produce a basic block, which is comprised of nodes that are sequentially executed without including branches (i.e., IF, GOTO, LOOP, and so on) and merging. Procedures having dependence relations with each other within the similar basic block are executed by use of asynchronous remote procedure calls with a dependence-relation-based wait. As for dependence relations between procedures across different basic blocks, a subsequent procedure is executed after waiting for the completion of a preceding procedure. With such a configuration, the generation of control programs is made easier by implementing procedure execution based on a wait mechanism with respect to complex control dependence relations between basic blocks, and, also, a needless waiting time is eliminated by use of an asynchronous remote procedure call with a dependence-relation-based wait within the similar basic block in which execution sequence is fixed.
In the parallelized program generation method described above, data transfer between processors across different basic blocks may be always performed by a control processor (e.g., PROCESSOR-0 in FIG. 2) or by a data transfer unit operating under the control of the control processor. Namely, data is first transferred from a first processor performing a procedure to the control processor (or the data transfer unit), and, then, is transferred from the control processor (or the data transfer unit) to a second processor performing a procedure. This arrangement is used because the central control of operations by the control processor is a relatively easy way to achieve proper data transfer under the conditions in which data to be transferred may differ depending on the results of a condition check in the original sequential program, and in which the execution of procedures in sequence may have dependence relations. Such a configuration in which the control processor intervenes for each data transfer, however, makes program execution inefficient, thereby creating needless delays in the execution of processes. Accordingly, it is preferable to perform data transfer across basic blocks directly between procedure executing processors without using an intervening control processor, thereby attaining efficiency in parallelized program execution.    [Patent Document 1] Japanese Patent No. 3028821    [Patent Document 2] Japanese Patent No. 3641997    [Non Patent Document 1] David W. Wall. Limits of Instruction-Level Parallelism. Proceedings of the fourth international conference on Architectural support for programming languages pp. 176-188 May. 1991.    [Non Patent Document 2] S. Horwitz, J. Prins, and T. Reps, “Integrating non-interfering versions of programs,” ACM Transactions on Programming Languages and Systems, vol. 11, no. 3, pp. 345-387, 1989.    [Non Patent Document 3] Jeanne Ferrante, Karl J. Ottenstein, Joe D. Warren, “The Program Dependence Graph and Its Use in Optimization,” ACM Transactions on Programming Languages and Systems, pp. 319-419, vol. 9 no. 3, July 1987.    [Non Patent Document 4] Susan Horwitz, Jan Prins, Thomas Reps, “On the adequacy of program dependence graphs for representing programs,” Proceedings of the 15th Annual ACM Symposium on the Principles of Programming Languages, pp. 146-157, January, 1988.    [Non Patent Document 5] Nakata Ikuo, “Configuration and Optimization of Compiler,” Asakura Shoten, 1999