The present invention relates to a compiler, control method, and compiler program. Particularly, the present invention relates to a compiler, control method, and compiler program which optimize a program by processing a plurality of tasks in parallel.
In recent years, there have been developed microprocessors including a plurality of processor cores. Each of the processor cores can perform operation interdependently from another processor core and can perform operation in parallel with another processor core. For example, POWER5 processor developed by the applicant of the present invention includes two processor cores, which can be operated in parallel. In addition, CELL BE processor developed by the applicant of the present invention includes eight processor cores, which can be operated in parallel.
FIG. 1 shows a configuration of a distributed memory type multi-core microprocessor 10. The microprocessor 10 includes processor cores 100-1 to 100-N, local memories 110 and DMA engines 120, the latter two of which correspond to the respective processor cores. The respective processor cores 100-1 to 100-N are connected with each other by a common on-chip bus 140. Moreover, the processor cores 100-1 to 100-N are connected to a main memory 1020 through an off-chip bus 150.
The processor core 100-1 reads a program from the local memory 110-1 to execute the program, and access data in the local memory 110-1 to advance processing. A processing result is outputted to the main memory 1020 at a predetermined timing. Here, similar to the conventional cache memory, the local memory 110-1 can be accessed at an extremely high speed as compared with the main memory 1020. Moreover, the on-chip bus 140 enables extremely high-speed commutations between the local memories 110-1 to 110-N as compared with communications via an off-chip bus 150.
In such the multi-core microprocessor, the performance of the entire processing largely differs depending on which task is executed in what order by each of the processor cores. This is because the storage capacity of each of the local memories 110-1 to 110-N is extremely small as compared with the main memory 1020. In other words, when a certain processing result of a first task is not used in a next second task, the processing result cannot be stored in the local memory and must be saved in the main memory 1020 so as to be used later.
Accordingly, for example, a second task using a processing result of a certain first task is preferably executed consecutively after the first task by the same processor core. Moreover, a third task using processing process of the first task during processing of the first task is preferably executed in parallel by the other processor core during the processing of the first task. Conventionally, there has not been proposed the technique for determining an efficient task execution order in a system by making use of the characteristic of the above-mentioned distributed memory type multi-core microprocessor.
Additionally, as reference techniques, there have been proposed techniques for efficiently executing the multiple tasks by each processor in the distributed memory type multiprocessor. For example, Y. Kwok and I. Ashmad compare and discuss algorithms for efficiently executing an entire program by analyzing a graph illustrating interdependency relations among the multiple tasks (refer to Y. Kwok and I. Ashmad. “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors,” ACM Computing Surveys, Vol. 31, No. 4, December 1999).
In a system including existing general multiprocessors, the processors can access the high-capacity memory at a high speed. In contrast to this, the communication rate between the processors is low. Accordingly, multiple tasks, which frequently communicate with each other, are executed by the same processor to reduce communication traffic volumes occurring therebetween. As a result, in some cases frequent task switching is needed in the same processor.
On the other hand, in the multi-core processors, multiple processor cores can communicate with each other at a high speed. However, in a case of executing different tasks consecutively by the same processor core, processing efficiency is reduced due to an occurrence of access to the main memory. To be more precise, since the local memory is not sufficiently large, the context of the previous task must be save to the main memory from the local memory and the context of a next task must be loaded onto the local memory from the main memory.
As mentioned above, the system including the multiprocessor and the multi-core processor are largely different in features and the techniques relating to the multiprocessor cannot be directly applied to the multi-core processor.
Moreover, as another reference technique, there has been proposed the technique for generating a series-parallel graph in which a group of tasks is combined into a cluster, and an interdependency relation between the clusters is indicated (refer to A. Gonzalez Escribano, A. J. C. van Gemund, and V. Cardenoso-Payo, “Mapping Unstructured Applications into Nested Parallelism,” Proceedings of VECPAR 2002—5th International Conference on High Performance Computing for Computational Science, LNCS 2565, 2003, pp. 407-420). However, in this technique, the information on a time required for execution of each cluster, a time required for communication between the clusters, or other is not used in generation of the graph. Furthermore, although there has been proposed the technique for scheduling the clusters based on the graph (refer to P. Chretienne and C. Picouleau, “Scheduling with communication delays: a survey,” In P. Chretienne, E. G. Coffman Jr., J. K. Lenstra, and Z. Liu, editors, Scheduling Theory and its Applications, chapter 4, pp. 65-90, John Wiley & Sons, 1995), in this method, it is assumed that an infinite number of processors is used. In other words, with these techniques, efficient scheduling is not performed by the distributed memory type multi-core processor.
Moreover, in order to apply a dynamic programming efficiently, it is necessary that an optimal solution of a whole problem be composed of a sum of optimal solutions of partial problems. More specifically, an execution time, at which each of the clusters is independently executed, has to coincide with an execution time, at which a certain cluster is executed when the certain cluster and another cluster are executed in parallel or in succession. In the multi-core processor, regarding execution efficiency with which a certain task is executed, processing efficiency largely differs depending on the other task, which is executed in parallel or in succession with the certain task. Accordingly, it is difficult to apply this technique directly.
Furthermore, MPI (Message Passing Interface) and PVM (Parallel Virtual Machine) are used as other reference techniques (refer to MPICH, http://www-unix.mcs.anl.gov/mpi/mpich/; LAM-MPI, http://www.lam-mpi.org; PVM, http://www.csm.ornl.gov/pvm/pvm_home.html; H. Ogawa and S. Matsuoka, “OMPI: optimizing MPI programs using partial evaluation,” Proceedings of the 1996 ACM/IEEE conference on Supercomputing, November 1996). According to these techniques, in the system including the distributed memory type multiprocessor, an application program can be efficiently operated in parallel. However, these techniques do not provide a function of statically analyzing the interdependency relation among the tasks and a function of analyzing the communication traffic volumes for which the respective tasks communicate with each other.