As computer systems grow more powerful, there are more resources to handle an ever increasing number of computing jobs or processes. By way of example, there are computer systems (stand-alone systems and/or networked systems) provisioned with hundreds of processors (CPUs), multiple database connections, multiple disk channels, and/or a large number of network links. These computers systems are required to handle computing needs of modern enterprises, which may involve handling a large number of processes concurrently.
Generally speaking, when multiple processes are present, these multiple processes compete against each other for these resources, and there is a corresponding need to schedule processes for execution on the available resources. Take the case of CPU resources, for example. Although the discussion herein focuses on CPU resources to simplify the discussion, it should be borne in mind that the problems discussed and the solutions offered herein are not limited to CPUs but are applicable to any resource that needs to be shared by different processes. One of the simplest ways to implement scheduling is to employ a single global queue to dispatch the next process to any CPU that can satisfy some predefined rule for fair allocation. However, the single global point of access can become contentious, particularly for a large computer system with a large number of CPUs, and performance may suffer. The single global queue approach is also particularly difficult to scale since changed global data must be communicated to all CPUs in the system. As the number of CPUs or processes increase, the performance penalty becomes prohibitive using the single global queuing approach.
One way to avoid the performance penalty associated with the single global queue approach is to employ multiple local queues, e.g., by provisioning each CPU with a local queue, and to manage each local queue independently. This decoupled approach tends to be more efficient in terms of low processing overhead but fairness frequently suffers.
One decoupled scheduling approach is round-robin scheduling. In pure round-robin scheduling, processes are assigned to the next CPU queue in a circular fashion. Thus, if there are 10 CPUs, the first process will be assigned to the first CPU, the second process will be assigned to the second CPU, and so forth. After the last CPU is reached on the tenth process, the first CPU is again assigned to the eleventh process, and hence the name round-robin.
A priority group refers to a plurality of processes having certain commonalities such that these processes can be grouped together and prioritized similarly for execution purposes. Priority grouping capability is offered by many vendors of schedulers, and many customers demand this capability. Having resource groups adds an extra dimension of complexity. If we had one member of each group in each CPU queue, then fairness would be easy. Each CPU could be handled with complete independence and achieve perfect fairness. As long as the group has at least on representative on each CPU queue, the algorithm is fair from the point of view of the group. But on the higher CPU count machines, every priority group will not have enough processes to populate every CPU. This increasingly common sub-case is called being under-committed. The round-robin approach tends to suffer a significant lack of fairness when priority groups are under-committed.
For example, if a group is entitled to 5% of a 100 CPU system, but only has 5 jobs running, it will expect to receive 5% because it is physically possible. In the unmodified round-robin scheme, every group starts distribution on the same CPU 0 and counts forward. Therefore, even in the under-committed case, the first few CPUs are likely to have members of every group and the later CPUs will be very sparsely populated. The net effect of this in the completely decoupled queue scheme is that the group would only get 5% of those 5 CPUs, or 0.25%. Jobs on the fully loaded CPUs would follow the fairness rules, while the ones on their own CPUs take everything. Some CPUs might go altogether idle. All the groups would underachieve their goals.
As a further example, assume that the reason for limiting the smallest group to 5% was that it was greedy and needed to be held back or it would take over the system. If the group contains enough processes to occupy every CPU, this group which is supposed to be limited could take over 53% of the total system. There are many corner cases which demonstrate that the isolated pure round-robin solution is not fair.
Other approaches such as batch scheduling, periodic rebalancing, credit/debit schemes, idle stealing, or robin hood scheduling also suffer from various deficiencies in fairness and/or performance, particularly when priority groups are involved. Many of these approaches require active compensation and/or management by the operating system (OS), thereby tending to increase the cost of scheduling and/or rendering these approaches difficult to scale to meet the needs of computer systems having a large number of resources and/or priority groups and/or processes. In particular, any scheme that relies on stealing of processes from one CPU by another ruins the cache performance and temporarily stops work on all CPUs involved. Maintaining CPU affinity for individual processes and minimal intrusion are critical for performance. But the fundamental premise of most of these schemes is that all processes are equivalent, which directly contradicts the need for fairness between the groups. The length of individual queues may have no relevance to the relative priority of its processes. The need for thinking in both axes, fairness and performance, simultaneously necessitates a new solution.