In multiprocessor architectures, it is desirable to keep a task executing on the same processor as much as possible so as to exploit caching benefits. This is particularly important in a Non-Uniform Memory Access (NUMA) architecture, where inter-cache access latency is significantly higher than intra-cache access latency. In such NUMA architectures, substantial performance degradation occurs if a task is frequently dispatched to processors not sharing the hardware cache.
In order to keep a task executing on the same processor (or a group of processors) as much as possible, a logical concept called affinity nodes is defined. An affinity node is a group of processors sharing a hardware cache. A task is marked with a value (called its affinity) which associates it with an affinity node. A Task Dispatcher tries to honor a task's affinity by always dispatching the task to the processors belonging to the affinity node designated by the task's affinity value.
Because tasks are always dispatched to their respective affinity nodes, over time the tasks' changing characteristics and processing demands will create processor load imbalance among affinity nodes. Therefore, a processor load balancing mechanism is needed to reassign tasks' affinity in order to balance the total system processor consumption across all affinity nodes. Additionally, affinity nodes can contain different numbers of processors therefore each affinity node may have a different capacity. The term “balance,” instead of the often-taken meaning of making processor consumption on all affinity nodes equal, generally means to make processor consumption on all affinity nodes satisfy certain criteria. A common criterion for a balanced system is one where the total system processor consumption is distributed across all affinity nodes in proportion to their capacity.
Existing processor load balancing schemes in production operating systems such as UNIX (and UNIX variants) have one common characteristic: they all use the average task run queue length as an estimation of the processor load. This is because task run queue length (also known as runqueue length) is easy to measure and the majority of today's production operating systems do not have a built-in facility for precise processor consumption measurement on a per task basis. While sufficient for most cases, average task run queue length does not always accurately reflect the true processor load.
Referring to FIG. 1 there is shown a simple example of a processor network 100 with two nodes which will serve to illustrate this concept. The system operates with two nodes 140 and 160, each node containing one CPU; and logical Task Managers 120 and 130. Task Manager 120 dispatches tasks 180 to node A 140 and Task Manager 130 dispatches tasks to node B 160. Because the Task Managers 120 and 130 operate independently, over time the processor load within the network 100 can become unbalanced due to the fact that the number of tasks and the CPU demand characteristics of tasks arriving on nodes 140 and 160 can be very different. Therefore, a Balancer 150 is needed to move tasks between the Task Managers 120 and 130 in order to balance the processor load on nodes 140 and 160. In known processor networks, a Balancer 150 judges the processor load on a node by using the average runqueue length of that node; the longer the queue length, the more loaded the node. However, this method has pitfalls, as we illustrate with examples below.
Assume n+1 tasks 180 enter affinity node A 140; the node uses its full processing capacity for a short period of time t; and then all tasks 180 finish. The processors in affinity node A 140 are then idle for a short period of time t before another n+1 tasks 180 enter the node A, and this cycle repeats until there are no more tasks in the queue with affinity A. For node A 140, on average the processor load is about 50% (i.e., half of the time the processor is busy and half of the time the processor is idle) and the runqueue length is about n/2 (half of the time there are n tasks waiting and half of the time there are no tasks waiting).
Now consider another affinity node, node B 160, where a single long-running processor bound task uses the full processing capacity of node B and no other tasks are waiting. For this node, on average the processor load is 100% (the processor is always busy) yet the runqueue length is zero (there are no tasks waiting). The Balancer 150, using the average runqueue length method, will move tasks from the half-loaded node A 140 to the fully-loaded node B 160, which would further unbalance the workload.
Another problem with the average task run queue length approach is that, when a task is moved to balance the processor load, no consideration is given to the actual processor consumption of the moved task, which can lead to further unbalancing of the system. Another example will illustrate this. Referring again to FIG. 1, consider that affinity node A 140 now has a long-running CPU bound task occupying the CPU in addition to the n+1 tasks 180 periodically entering the node. For this node, on average the processing load is 100% and the runqueue length is about (n+1)/2. Now consider affinity node B 160 with the same single long-running CPU bound task. For node B 160, on average the processing load is 100% and the runqueue length is 0. This system is clearly unbalanced and tasks need to be moved from node A 140 to node B 160. The Balancer 150 generally selects the task(s) to be moved by first moving the tasks 180 at the end of the task run queue because tasks at the end of the queue have the longest wait time. Using this method without taking into consideration the processor consumption of the tasks 180 to be moved can potentially unbalance the system further because the long-running CPU bound task on node A 140 can happen to be the last task on the queue and may be moved to node B 160. As a result, node A 140 will again be idle half of the time (as illustrated with the previous example when node A 140 did not have the long-running CPU bound task) while node B 160 will be overloaded with two long-running CPU bound tasks, each getting only 50% of the processor capacity.
FIG. 2 shows another representation of load balancing in a processor architecture 200 with multiple processors. The figure shows a typical NUMA processor architecture with the two affinity nodes of FIG. 1, plus two more affinity nodes, each node now with four CPUs instead of one. In practice, both the number of affinity nodes and the number of CPUs in each node can be different. The four CPUs in each node share L2 cache (Level 2 cache, or cache memory that is external to a microprocessor). L2 cache memory resides on a separate chip from the microprocessor chip, as opposed to Level 1 cache which resides on the microprocessor chip. An L2 cache is necessary in order for multiple CPUs to share cache memory. In this processor architecture 200 the Task Managers 120 and 130 of FIG. 1 are now actually four dispatchers, each dispatcher responsible for one node. Each dispatcher receives a job queue of tasks for its particular node. Dispatcher 225 dispatches the tasks from job queue A 220 to node A 140; Dispatcher 235 dispatches the tasks from job queue B 230 to node B 160; Dispatcher 245 dispatches the tasks from job queue C 240 to node C 250; and Dispatcher 255 dispatches the tasks from job queue D 250 to node D 260.
Each node is connected through the network 200 to every other node in the network for load balancing purposes. The architecture 200 as shown in FIG. 2 is quite a simple multi-processor architecture, yet it is scalable to a large degree. In fact, with today's ever-increasing processing needs, networks of thousands and even tens of thousands of processors are in use, making CPU load balancing imperative, yet current load-balancing algorithms fall short of reaching an optimal balance because average runqueue length does not always accurately reflect the true processor load.
Therefore, there is a need for a processor load-balancing method to overcome the shortcomings of the prior art.