Scheduling is the act of time-sharing resources between multiple resource requesters. In a computer system, tasks are scheduled to utilize processing time on available computing resources. Scheduling can be driven by various decision-making constraints. For example, tasks have to be scheduled to meet certain deadlines or have to use processing resources efficiently to increase the throughput. The emergence of parallel computing systems introduces additional challenges in scheduling. In a uniprocessor system, scheduling comprises the sequencing of tasks to utilize a single processor whereas in a multiprocessor system, tasks have to be distributed to multiple processors to speed up the execution of the program. However, in a parallel computer system, the processors can have non-uniform memory access times. As a consequence, the execution time of a task can depend on the utilized processor and its memory access time to the data used by the task. For instance, the memory access times can be higher, if the execution of a task is mapped to a processor which is located remote to the used data, as if mapped to a processor that is nearby to the used data. By means of a scheduler, the tasks are distributed to different processor cores of processors at runtime. To achieve a high performance, tasks may be executed on processor cores which have already the necessary corresponding data used by the task in their respective cache memory. Otherwise, the used data has first to be loaded which takes additional time. This is particularly relevant in a multiprocessor system with distributed memory, e.g. a non-uniform memory access system (NUMA). In such a system, the data has to be loaded under certain circumstances via a communication network from a remote memory which can lead to a significant reduction of performance of the respective system. A conventional way to avoid such performance losses is to use heuristics in the scheduler, which for instance make sure that child tasks are executed on the same processor cores as the respective parent tasks, as described for instance in Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou “An Efficient Multithreaded Runtime System”, Symposium on Principles and Practice of Parallel Programming (PPOPP), ACM, 1995. However, these kinds of heuristics fail for instance, when the same data is accessed several times by sequential loops. The reason for that is that these heuristics do not have any information about the location of the data in the cache memories or in the main memory of the computer system. To overcome this problem, some libraries offer mechanisms which consider the data location for simple loops and use this information for scheduling. An example for such a concept is “affinity partitioners” used in Threading Building Blocks, which is a library of Intel for parallel programming in C++, as described under http://threadingbuildingblocks.org. However, these kinds of mechanisms can for instance not be used for recursive calculations or algorithms.
Another conventional approach is to use in the source code of the application explicitly data location information that influences the scheduling. For this, the software developer has to indicate where specific data is read or changed. Obviously, a significant disadvantage of this conventional approach is that the developer has to encode the necessary operations within the source code. This increases the complexity of the source code and makes it more difficult to maintain the developed software code.
Accordingly, there is a need for a method and apparatus for scheduling of tasks of a parallel computing system with several processor cores to increase the performance or throughput of the computing system without increasing the complexity of the source code.