Operating System
A kernel is the executable software controlling the operation of a computer system. The kernel is loaded into main memory first on startup of a computer system and remains in main memory providing essential services, such as memory management, process and task management, and disk management. The kernel manages multiple aspects of process execution on a computer system. Processes may be typical programs such as word processors, spreadsheets, games, or web browsers. Processes are also underlying tasks executing to provide additional functionality to either the operating system or to the user of the computer. Processes may also be additional processes of the operating system for providing functionality to other parts of the operating system, e.g., networking and file sharing functionality.
The kernel is responsible for scheduling the execution of processes and managing the resources made available to and used by processes. The kernel also handles such issues as startup and initialization of the computer system.
As described above, the kernel is a central part of an operating system (OS). Additional software or code, e.g., a program, process, or task, is written for execution on top of or in conjunction with the kernel, to make use of kernel-provided services, information, and resources.
Threads
Processes executing on a processor, i.e., processes interacting with the kernel executing on a computer system, are also known as execution threads or simply “threads.” A thread is the smallest unit of scheduling on an OS. Normally, each process (application or program) has a single thread; however, a process may have more than one thread (sometimes thousands). Each thread can execute on its own on an operating system or kernel.
Load Balancing of OS
The kernel allocates threads for execution to different processors using a process known as load balancing. During typical load balancing of multiple processor computer systems, each processor is evaluated to determine the current load on the processor. The load on a particular processor is determined by counting the number of threads ready to run on the processor, e.g., the number of threads in a processor queue. Three load balancing methods are now described.
Example of Load Balancing
Run-With-Parent
In Run-With-Parent load balancing, the new thread runs on the same processor as the thread that created it. This form of load balancing is easily implemented and runs very fast, but can produce large load imbalances when large numbers of new threads are not dissipated by some other means (usually separate load balancing software).
Round-Robin
In Round-Robin load balancing, the new thread runs on a processor chosen from a list using a scheme such that no processor is chosen again until every processor has been chosen once. This form of load balancing is easily implemented and runs very fast, but has an inherent positive feedback loop such that this form of load balancing exaggerates small load imbalances until they become large load imbalances.
Lightest-Loaded
In Lightest-Loaded load balancing, the new thread runs on the processor with the lightest current load. This form of load balancing is easily implemented and avoids the problem of load imbalances associated with the round-robin load balancing method, but can be expensive to run on systems with many processors because each processor act of thread creation requires that a processor check all the other processors to find the one with the lightest load. In addition to scaling poorly, each processor creating a new thread disturbs part of the cache of every other processor, resulting in lower system performance.
As described above, prior solutions include choosing the parent thread's processor (which can create load imbalances), choosing the lightest-loaded processor (which helps balance the load but has not performed or scaled well), and round-robin (which sounds fair, but which actually encourages load imbalances).
Load balancing problems show up as excessive kernel thread migration under both light and heavy loads. During experiments performed by the inventor, an increased (processor-to-processor) thread and/or process migration (even on lightly-loaded systems) was observed on systems using one of the above-described load balancing technique.
Heavy Loads (SDET)
Software Development Environment Throughput (SDET) is a widely-used benchmark designed to simulate a timeshare system under heavy load by many typical users running many generally short-lived programs. As a result, the fork/exec/exit paths are used extensively.
Thread migration occurs under the following circumstances:
1. a thread binds itself to a processor its not currently running on;
2. a thread performs an I/O operation whose driver is bound to another processor;
3. an idle processor steals a thread waiting on another processor's run queue; and
4. the load balancer moves a thread from a heavily-loaded processor to a lightly-loaded processor.
Inventor measurements revealed that circumstances 1 and 2 above do not occur while running SDET. Circumstance 3 occurs only during cool-down at the end of the run and is not a factor during the main part of the run. Therefore, circumstance 4 was investigated more closely.
In the absence of real-time threads (as in SDET), the load balancer only moves threads from processor to processor when the distribution of threads waiting to run wanders far (enough) out of balance, i.e., beyond a predetermined threshold.
With this in mind, the number of threads waiting to run on each processor at each second during an SDET run was measured on a four processor computer system, but using two different load balancing techniques. FIG. 1 is a graph of the measurements for the four processor system using a lightest-loaded technique. FIG. 2 is a graph of the same measurement for a similar run on the system using a round-robin technique. Similar results were measured on a different 4-way (i.e., 4 processor) and a, 8-way (i.e., 8 processor) system, as well as on a 32-way (i.e., 32 processor) system.
There is a greater load variation on the second system as shown in FIG. 2 in comparison to FIG. 1. The root cause is the thread launch policy, which changed in the second system. Three of the thread launch policies investigated include: Father Knows Best, Lightest-Loaded and Round-Robin.
Father Knows Best is easily implemented, runs very fast, and provides good cache behavior because common UNIX idioms (like “a|b”) are implemented to share processor caches. The main fault of the Father Knows Best policy is a heavy reliance on the load balancer to prevent large load imbalances.
Lightest-Loaded does not depend on the load balancer to correct load imbalances, but does not scale well. This is because every processor's load statistics are scanned on every fork, so the cache i.e., processing queue, lines are constantly bouncing from processor to processor when the system forks frequently (as happens in SDET).
At first, Round-Robin may seem a fair way to solve these problems. After all, each processor is handed the same amount of work so the load balancer is not heavily relied on, and it should scale well because the implementation is simple and doesn't require any processor to look at another processor's resources. In fact, Round-Robin is not fair at all. During an SDET run, Round-Robin ensures that the thread creation rate on each processor is equal to all the other processors. But, any processor with a higher-than-average load will have a thread extinction or termination rate lower than average. This is because the increased load means each thread takes longer to finish. Therefore, the load on processors with slightly higher than average loads increases over time.
Similarly, any processor with a lower-than-average load has a thread extinction rate higher than average. This is because the decreased load means each thread finishes sooner. Therefore, the load on processors with slightly lower than average loads decreases over time.
This is a positive feedback loop. Heavily-loaded processors get even heavier loads and lightly-loaded processors get even lighter loads. Any small load imbalance is amplified using a Round Robin launch policy.
The load balancer attempts to even things out, but can't keep up. One reviewed solution of having the load balancer run more often, or making it move more threads per pass does not help. It only increases the excessive thread migration and lowers throughput.
Light Loads
A round-robin launch policy explains excessive thread migration on a very lightly loaded system. In an exemplary 4-way system, one compute-bound program is running and a few users at workstations are typing an occasional simple, short-lived command (ls, for example).
The round-robin launch policy starts each of these commands on a different processor as time passes. Every fourth launch (on average), the launch policy chooses the processor where the compute-bound process is running. Because the compute-bound process has used a lot of time recently, its priority is weak, i.e., the process will be less likely to obtain processor time. For similar reasons, the new process' priority is strong. The new process runs immediately, preempting the compute-bound process. The compute-bound process is returned to its run queue where it is immediately moved to an idle processor. After a few more forks, time passes, and the processor with the compute-bound process has the process migrated to yet another processor. This continues indefinitely with the computer-bound process being moved among the processors.
The Father Knows Best launch policy does not exhibit this problem because the new thread starts on the same processor as its parent which cannot be where the compute-bound process is running (since they are both running at the same time).
The Lightest-Loaded launch policy does not exhibit this problem because the new thread is not started on the processor where the compute-bound process is running because this processor does not have the lightest load.