Field of the Invention
This invention relates to schedulers as found in modern operating systems and in particular to a scheduler for use in a computer system with a multi-threaded and/or multi-core architecture.
Background Art
As is well known, modern computer systems consist of one or more central processing units (CPUs), as well as supporting hardware such as memory and memory management units (MMU) for each CPU, as well as less essential peripheral hardware such as I/O devices like network interfaces, disks, printers, etc. Software is also part of a computer system; typically, a software application provides the ultimate utility of the computer system for users.
Users often want to use more than one of these software applications, perhaps concurrently. To make this possible, software applications are typically written to run on top of a more privileged piece of software, often known as the “operating system” (OS), which resides, logically, as or in an intermediate software layer, between the applications and the underlying hardware. The OS uses a more privileged mode of the CPU(s), so that it can perform operations which software applications cannot. One of the main jobs of the OS is to coordinate the access by the various applications to shared system resources.
Scheduler
Given multiple applications that are to share some system resource, such as CPU or I/O access, some mechanism must exist to coordinate the sharing. In modern OSs, this mechanism is usually called a “scheduler,” which is a program that coordinates the use of shared resources according to certain rules programmed into the scheduler by the designer.
The most fundamental shared resource is access to the CPU(s), since such access is required for execution of any code. Almost all modern operating systems export some notion of “task” or “process,” which is an abstraction of a CPU and memory. A task is conceptually similar to an execution vehicle, and typically corresponds to a single logical activity that requires computational resources (memory, CPU, and I/O devices) to make forward progress. The operating system multiplexes these tasks onto the physical CPUs and other physical resources of the system.
Each task usually comprises one or more execution abstractions known as “threads.” A thread typically includes its own instruction pointer and sometimes has its own stack. Typically, access to a CPU is scheduled per-thread. A task is thus an environment in which one or several threads are scheduled independently to run on the CPU(s), and not necessarily all (or even more than one) at a time even in multi-processor architectures.
A standing goal of all computer design—of both hardware such as CPUs and software such as OSs—is to enable applications to run as fast and as efficiently possible, even when sharing system resources, including the CPU(s). One way to accomplish this is of course though the design of the applications themselves. Another way is through efficient design of the OS, which usually entails computing an efficient schedule for executing threads. A specific scheduling problem is discussed below, but before this it is helpful also to consider some of the different hardware techniques that are being employed to increase overall execution speed, since these hardware choices also impact the problem of scheduling.
Multiprocessor Architectures
Most personal computer systems are equipped with a single CPU. Because CPUs today are quite fast, a single CPU often provides enough computational power to handle several “concurrent” execution threads by rapidly switching from thread to thread, or even task to task (a procedure sometimes known as time-slicing or multiprogramming). This management of concurrent threads is one of the main responsibilities of almost all operating systems.
The use of multiple concurrent threads often allows an overall increase in the utilization of the hardware resources. The reason is that while one thread is waiting for input or output to happen, the CPU may execute other “ready” threads. However, as the number of threads, or the workload within each thread, increases, the point may be reached where computational cycles, i.e., CPU power, is the limiting factor. The exact point where this happens depends on the particular workloads.
To permit computer systems to scale to larger numbers of concurrent threads, systems with multiple CPUs have been developed. These symmetric multi-processor (SMP) systems are available as extensions of the PC platform and from other vendors. Essentially, an SMP system is a hardware platform that connects multiple processors to a shared main memory and shared I/O devices. In addition, each processor may have private cache memory. The OS, which is aware of the multiple processors, allows truly concurrent execution of multiple threads, typically using time-slicing only when the number of ready threads exceeds the number of CPUs.
Multi-Core Architectures
Because of advances in manufacturing processes, the density of semiconductor elements per chip has now grown so great that “multi-core” architectures have been made possible; examples include the IBM POWER4 and POWER5 architectures, as well as the Sun UltraSparc IV. In these devices, more than one (at present, two, although this is a currently practical rather than a theoretical limitation) physical CPU is fabricated on a single chip. Although each CPU can execute threads independently, the CPUs share at least some cache and in some cases even other resources. Each CPU is provided with its own set of functional units, however, such as its own floating-point and arithmetic/logic units (ALU). Essentially, a multi-core architecture is a multi-processor on a single chip, although with limited resource sharing. Of course, the OS in such a system will be designed to schedule thread execution on one of the multi-core CPUs.
Simultaneous Multi-Threaded (SMT) Architectures
Still another modern technique that provides for simultaneous execution of multiple threads is referred to as “simultaneous multi-threading,” in which more than one logical processor (hardware thread) operates simultaneously on a single chip, but in which the logical processors must flexibly share not only one or more caches (for example, for data, instructions and traces), but also functional units such as the floating-point unit and the ALU, as well as the translation lookaside buffer (TLB), if the TLB is shared.
As one example of an SMT architecture, Intel Corporation has developed its “Hyper-Threading Technology” to improve the performance of its Pentium IV and Xeon processor lines. In Intel's terminology, the single chip is referred to as a “package.” While multi-threading does not provide the performance of a true multi-processor or multi-core system, it can improve the utilization of on-chip resources, leading to greater throughput for several important workload types, by exploiting additional instruction-level parallelism that is exposed by executing the instruction streams associated with multiple threads concurrently.
To understand the performance implications of simultaneous multi-threading, it is important to understand that most internal processor resources are shared between the two executing threads. For instance, in the Intel architecture, the L1, L2 and L3 caches and all functional units (such as the floating point units and arithmetic/logical units) are flexibly shared between the two threads. If one thread is using very little cache, then the other thread will be able to take advantage of all the unused cache space. However, if both threads demand large amounts of cache, they will compete for the limited capacity and likely slow each other down.
In an SMT system, the OS designates which software threads the logical processor(s) are to execute, and can also issue commands to cause an idle logical processor to be put in a halt state, such that its execution resources are made available for use by any remaining logical processors. Once threads are scheduled for execution on a multi-threaded hardware processor, internal mechanisms of the processor control use of the shared resources by the executing threads. At any time, the operating system can preempt a thread, that is, force it to give up the CPU on which it is running, in order to run another thread (perhaps one that has not run for some time, or one that the user has given a higher priority to). Putting a processor into the halt state typically involves preempting the running thread and instead scheduling on that processor a dedicated idle thread. This idle thread may use a processor-specific method to make the execution resources from the hardware context available to other threads in the same functional processor group. For instance, on the Intel IA-32 architecture, the idle thread may issue the “HLT” instruction.
Because at least one resource is shared between the logical processors of a multi-threaded system, the problem can arise that one thread might be “anti-cooperative,” meaning that it does not conform to a predetermined notion of “fairness.” Examples of anti-competitive execution behavior include using so much of or otherwise “hoarding” the shared resource or causing some other state change in the resource, such that a co-executing thread cannot execute as efficiently as it would if it had exclusive or at least “normal” use of the resource, or such that hardware or software intervention is required. In extreme cases, one thread could theoretically even completely prevent another thread from making forward execution progress, that is, “starving” it, for lack of the shared resource.
One example of this problem is described by Dirk Grunwald and Soraya Ghiasi in “Microarchitectural denial of service: insuring microarchitectural fairness,” International Symposium on Microarchitecture, Proceedings of the 35th annual ACM/IEEE International Symposium on Microarchitecture, Istanbul, Turkey, pp. 409-18, 2002. Although most anti-cooperative applications in the specific SMT architecture they studied caused performance degradations of less than five percent, Grunwald and Ghiasi showed that a malicious application could degrade the performance of another workload running on the same physical package by as much as 90% through, for example, the use of self-modifying code in a tight loop.
Existing OS schedulers are not designed to cope with such problems as a microarchitectural denial of service conflict (or outright attack); rather, known schedulers may adjust the amount of execution time allocated to each of a set of runnable threads, but this ignores that the allotted execution time of a given thread may be wasted because of the actions of a co-executing, anti-cooperative thread. For example, as Grunwald points out, self-modifying code can lead to frequent complete flushes of a shared trace cache, which means that the cached information of the other running thread will also be lost, such that many processing cycles are needed to build it back up again, over and over. Even though the “nice” thread will have its allotted execution time, it will not be able to use it efficiently and the OS scheduler will not be able to do anything to improve the situation, assuming that the scheduler detects the situation at all.
Grunwald offers four possible solutions to the problem microarchitectural denial of service. First, Grunwald detects the need for intervention using various mechanisms such as performance counters, computing a function of committed instructions, and monitoring bad events such as cache and pipeline flushes. Then he applies one of four proposed “punition” mechanisms, all of which involve either stalling or suspending offending threads, or specifically modifying the OS kernel so that it changes the scheduling interval of an attacking thread. Even Grunwald acknowledges the inadequacy of his proposed software solutions, however, stating that “we think it is better to implement them in microarchitecture” in order to provide “compatibility across a number of operating systems, eliminating processor-specific features.”
In general, to the small extent that system designers have recognized and addressed the problem of anti-cooperative processes in multi-threaded environments at all, the solutions have focused either on hardware support, or on ways for the OS scheduler to detect anti-cooperativeness and to adjust the execution time slice given to currently offending processes. One solution proposed by Allan Snavely and Dean M. Tullsen in “Symbiotic jobscheduling for a simultaneous multithreaded processor,” ACM SIGOPS Operating Systems Review, v.34 n.5, p. 234-244, December 2000, involves an “SOS” (Sample, Optimize, Symbios) scheduler that samples the space of possible schedules, examines performance counters and applies heuristics to guess an optimal schedule, then runs the presumed optimal schedule.
In a refinement, described by Allan Snavely, Dean M. Tullsen and Geoff Voelker in “Symbiotic jobscheduling with priorities for a simultaneous multithreading processor,” ACM SIGMETRICS Performance Evaluation Review, v.30 n.1, June 2002, Snavely et al. incorporate the notion of priorities into the scheduling decisions, such that if a particular thread has a high enough priority, then idle threads are scheduled to run alongside it in the same package so that it is guaranteed enough CPU time.
One problem with both of Snavely's approaches are the Sample and Optimize phases, during which the processors are devoted to test cases. Only in a later phase are threads actually allowed to run so as to do the work they are intended to do. Because Snavely's method is two-pass, it is not suitable for run-time detection and alleviation of anti-cooperative behavior at actual run time.
Yet another disadvantage of Snavely's approaches is that his systems do not directly attempt to determine anti-competitive behavior. Because of this, threads that, during the Sample and Optimize phases, appeared to run well together, may not when actually running under normal conditions. In other words, Snavely assumes that threads will cooperate as well during actual “working” execution as they did during the Sample phase, but this assumption may not be correct—Snavely cannot detect and deal with previously undetected, run-time anti-cooperativeness.
Snavely's scheduler attempts to optimize how much CPU time each thread will get. In the presence of run-time anti-cooperative execution behavior, however, merely allocating more CPU time to a thread does not ensure optimal execution progress. As Grunwald points out, however, even very small thread segments (with self-modifying code, for example) can cause severe performance degradation of another running thread, such that merely reducing allocated time may not eliminate the problem: For example, a processor may have 90% of the total CPU time, but the 10% used by another, coscheduled and highly anti-cooperative thread might cause much of the other processor's 90% to be wasted recovering from the resource hoarding of the anti-cooperative thread. Merely adjusting the amount of time allocated to a given thread therefore ignores the unique features of the SMT architecture, in particular, the presence of more than one logical processor, and simply applies a solution that is also applicable to standard, single-processor systems.
Conversely, an anti-cooperative process is not necessarily malicious and may in fact be one that the user wants to have run quickly, perhaps even with a higher priority than other runnable threads. For example, a user may suppose that a particular important process contains self-modifying code in a tight loop, or has in the past caused problems for co-scheduled threads in an SMT architecture. Stalling or suspending this thread would therefore benefit other threads, but would lead to a worse result from the user's perspective.
Proposed mechanisms for dealing with the problem of shared resource hoarding in multi-threaded architectures fail to provide the user with any ability to influence how the OS addresses the problem. It would thus be beneficial to enable the user to control at least some of the decision about what to do in the presence of an anti-cooperative process in a multi-threaded architecture.
What is needed is a mechanism that more efficiently addresses the problem of anti-cooperative and malicious threads in multi-threaded processor architectures, and that preferably does so with no need for hardware support other than that already provided by the multi-threaded processor. Optionally, it would also be beneficial to give the user at least some control over the mechanism.