1. Technical Field of the Invention
The present invention relates to method and system for mapping threads or tasks of a parallel program to CPUs of a parallel computer.
2. Background Art
A parallel computer is either a single computer having multiple CPUs or several computers connected with a network such that at least one of them has multiple CPUs.
Parallel programs exploiting this parallel computer either start several instances of themselves (usually called tasks) or split into subprograms, each handling a sequence of instructions in parallel (usually called threads).
A typical example of a parallel program is shown in FIG. 1. FIG. 1 shows three tasks spawning two threads each. A possible case could be that the parallel program has to solve a certain problem for different initial parameters. Then each thread is endowed with the solution of the problem for one of those parameters. In FIG. 1 each task would solve the problem for two parameters by delegating one parameter to each thread. The threads are supposed to run on different CPUs.
Initial Problem
In general the mapping of threads or tasks to different computers in the network is done manually by the user, e.g. by writing a file containing the names of the computers. The mapping of threads to CPUs inside those computers is usually automatically done by the operating system.
If the user wants to improve the performance of the parallel program on the current computer, or if he wants to simulate the performance of the program on a differently architected parallel computer, he needs to replace the automatic mapping by a mapping of his own.
Replacing the automatic mapping by manual mapping of the tasks or threads to the CPUs of a computer can only be done if the operating system provides a command or a subroutine (system call) that binds a specific task or thread to a specified CPU.
If the goal is performance improvement of the parallel program, the optimal mapping is far from obvious. It is therefore desirable to try different mappings and easily switch from one to the other.
If the user is out for the optimization of the performance, and if execution time of the program stays within minutes, it is sufficient to test several mappings for the program. If the program, however, executes a loop of instructions many times and runs for hours, it is desirable to use the first loop iterations for calibration and run the remaining loop iterations with the optimal customized map. This of course is only possible if the map can be changed at runtime.
Prior Art
If the operating system provides a command or a subroutine (system call) that binds a specific task or thread to a specified CPU, the following user actions would be needed.
If the user has the choice and decides for the use of the command, he has to find all task or thread identifiers and has to run one command for each task or thread of the parallel program, possibly even logging in on the various nodes where his threads and tasks are running. By the time he has done all this, either the parallel program has already finished or it has progressed significantly in its run. So the remapping of the tasks or threads either did not come into effect at all or is taking on too late. At any rate, the remapping of tasks or threads was not done concurrently, but one at a time with various delays in between.
Therefore, the only feasible solution is given by each thread or task of the parallel program issuing a system call for binding itself to a certain CPU of the node and all of these calls occurring in parallel. FIG. 2 shows how three tasks spawn two threads each, and each of the threads binds itself to a CPU.
Residual Problem
As a consequence of the above-mentioned user actions for manual binding, the user not only has to add the system call to the source code of the parallel program, but also has to add code lines to implement the mapping, i.e. the rules specifying which thread or task is assigned to which CPU.
As a result, only the simplest mappings are usually implemented by programmers. As every mapping is implemented in the source code, a change requires a full development cycle, which involves all of the following: (1) formulation of a new mathematical function; (2) implementation in the source code; and (3) debugging and testing. This significantly slows down the test of various mappings and even prevents the change of the mappings at runtime.
The implemented assignment in most cases simply is thread/task 0 to CPU 0, thread/task 1 to CPU 1, thread/task 2 to CPU 2 etc. These assignments are shown as a table in FIG. 3.