The present invention relates generally to the field of computers and computer systems. More particularly, the present invention relates to the allocation of processes to individual processors (nodes) in multiprocessor systems.
Modern computer systems with many processors often have non-uniform memory access (NUMA) properties; that is, the cost of accessing data in memory is dependent on the physical location of the memory in relation to the processor which accesses it. As a result, performance improvements can often be gained by running an application on a limited number of processors and allocating memory which is local to those processors, thereby reducing or eliminating the need for costly remote memory accesses. Similarly, multiple threads which frequently access and modify areas of memory which are shared with other threads can benefit from keeping all users of that memory close together, to reduce the amount of cross-node traffic to obtain cache lines which exist in the cache of a remote processor. These two issues can be referred to as memory affinity and cache affinity.
Placing processes in order to increase the benefits of memory and cache affinity typically conflicts with the more general desire to balance work across all available resources of the whole system; clearly, placing all work onto a single node and allocating all memory locally will increase cache and memory affinity, but in general will not provide good performance for all workloads, due to the increased contention for resources on that node. It is therefore desirable to identify tasks which can benefit from memory and cache affinity and group them together, such that a group of related tasks will tend to run closer together, but that unrelated tasks may be placed across other parts of the system.
There are several existing techniques for identifying this grouping, all of which have drawbacks.
1. Have no automatic grouping of tasks performed by the operating system, but allow the user to group tasks and bind them to specific system resources. This approach relies heavily on the user understanding the behaviour of the workloads and the architecture of the system, and is both time consuming and error prone. Such manual bindings also typically restrict the operating system's load balancing capabilities, thus making it less responsive to changes in load.
2. Have the operating system attempt to group threads of the same process together, but treat processes as separate entities. This can provide significant benefit for some workloads, as threads of the same process will (in most operating systems) share the same address space and are likely to have considerable overlap in the working set of data used by the threads. However, this approach alone does not account for groupings of multiple processes, which means a significant potential benefit is not catered for.
3. Group all threads and processes based on parent-child relationships. This is the approach described in “An Experimental Evaluation of Processor Pool-Based Scheduling for Shared-Memory NUMA Multiprocessors” by T. Brecht, IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing, ISBN 3-540-63574-2, in which no distinction is made between threads and processes, and each time a new thread/process is created, the allocator attempts to place it close to its parent. However, this can mean that tasks which have no significant relationship to the parent will be placed near it, possibly at the expense of future more closely related tasks.
What is required, therefore, is a means to identify groups of processes that can benefit from cache and memory affinity without suffering from these drawbacks.
It should be noted that the term “multiprocessor” as used herein encompasses dual- and multi-core processor devices, as well as multiple hardware thread and multiple CPU systems.
A system which seeks to address some of the above issues is described in U.S. Pat. No. 5,826,079 which relates to a method for improving the execution efficiency of frequently communicating processes utilising affinity scheduling by identifying and assigning the frequently communicating processes to the same processor. The system is based on counting “wakeup” requests between two processors: a wakeup request occurs when a first process requiring information from a second process is placed in a sleep state until the second process is able to provide the required information, at which point the first process is awoken. A count of the number of wakeup requests between the pair of processes is maintained and, when a predetermined threshold is reached, the two processes are assigned to the same processor for execution. Whilst this allocation can improve performance, the determination is non-optimal, as will be described below.
It is therefore an object of the present invention to provide a means for providing an improved allocation of processes to processors in a multiprocessor system and, in particular, a means capable of identifying and addressing potential conflict issues before they arise.