In the recent years, the computer hardware industry has begun full-scale production of multi-core processors, which are processors that have multiple processing cores inside them. The multiple processing cores enable multiple application threads to run simultaneously on a single processor and present the promise of improved power efficiency, increased hardware utilization and reduced cost due to elimination of duplicated components. Consequently, most major hardware vendors are shipping multi-core systems and have announced plans to release future models.
Each processor core has a multi-level cache memory system in which a separate first-level (L1) cache memory may be provided for each core. A second-level (L2) cache memory is shared among the cores and the threads that may be running on each processor core. In other architectures, both the L1 and L2 cache memories may be shared. Conventional CPU schedulers schedule threads based on assumptions that often do not apply on multi-core processors. For example, such schedulers typically assume that the CPU is a single, indivisible resource and that, if threads are granted equal time slices, those threads will share the CPU equally. However, on multi-core processors, concurrently running threads, called “co-runners”, often share a single cache memory, and cache allocation is controlled by the hardware.
FIG. 1A shows a pair of conventional processors 100 and 102 which are controlled by an operating system scheduler 101. The operating system scheduler 101 determines which threads run on the processors and the length of time, or time quantum, that each thread runs. Processor 100 includes a processor core 104. Processor core 104 accesses memory via a hierarchical cache memory comprising L1 cache memory 106 and L2 cache memory 108. It is assumed that the thread running on core 104 utilizes all of the L2 cache memory 108 as indicated by the shading of cache memory 108. A miss in the L2 cache memory 108 causes the processor core 104 to access main memory as indicated by arrow 110. Similarly, processor 102 includes a processor core 112. Processor core 112 accesses memory via a hierarchical cache memory comprising L1 cache memory 114 and L2 cache memory 116. It is assumed that the thread running on core 112 utilizes all of the L2 cache memory 116 as indicated by the shading of cache memory 116. A miss in the L2 cache memory 116 causes the processor core 112 to access main memory as indicated by arrow 118.
FIG. 1B shows a similar situation where the same threads that are running on processor cores 104 and 112 are, instead running on processor cores 122 and 124 of multi-core processor 120. Processor cores 122 and 124 are, in turn, controlled by operating system scheduler 136 which determines the threads that run on the cores 122 and 124 and the amount of time that the threads run. In this situation, each of processor cores 122 and 124 has its own L1 cache memory 126 and 128, respectively. However, both cores 122 and 124 share the L2 cache memory 130. The cache memory shading illustrates the working set of the two threads running on cores 122 and 124. As with situation shown in FIG. 1A, it is assumed that the working set of each thread is large enough to populate the entire L2 cache. However, when the threads become co-runners on the multi-core processor 120, the L2 cache memory 130 is not equally allocated among the threads as illustrated by the shading 132 for the thread running on core 122 and the shading 134 for the thread running on core 124. Thus, cache sharing often depends solely on the cache needs of the co-runner(s), and unfair cache sharing occurs often. The cache occupancy of a thread affects its cache miss rate, and, as a result, impacts the rate at which the thread retires instructions. Therefore, the CPU performance of a thread significantly varies depending on its co-runner.
Co-runner dependency is illustrated in FIGS. 2A and 2B. FIG. 2A depicts threads running in a processor with two processor cores operating with a conventional operating system scheduler in an ideal scenario where the L2 cache memory 130 is shared equally. The figure shows a graph 200 depicting cache memory allocation along the vertical axis and CPU time along the horizontal axis. There are three threads (A though C) running on the dual-core processor 120 and three CPU time slots 202-206 are illustrated. In the figure, a box corresponds to each thread. The height of the box indicates the amount of cache memory allocated to that thread. The width of the box indicates the CPU time quantum allocated to the thread. Accordingly, the area of the box is proportional to the amount of work completed by the thread. Thread boxes stacked on top of one another indicate co-runners. In this ideal situation, the L2 cache memory 130 is shared equally as indicated by the equal heights of the boxes.
The CPU latency of a thread is defined as the time to complete a logical unit of work and is a function of how efficiently the thread uses the CPU cycles it has been assigned and the length of time that thread runs on the CPU. In FIG. 2A, the work completed by thread A includes the shaded areas during time slots 202 and 206. Thus, assuming that a unit if work is equal to the sum of the two areas, the CPU latency of thread A is indicated by bracket 208.
FIG. 2B illustrates co-runner dependency in a processor with two processor cores operating with a conventional operating system scheduler. The figure also shows a graph 209 depicting cache memory allocation along the vertical axis and CPU time along the horizontal axis. The same three threads (A though C) as shown in FIG. 2A are running on the dual-core processor and four CPU time slots 210-216 are illustrated. As illustrated, thread B requires more L2 cache memory than thread A as shown by the increased height of its corresponding box. Thus, during time slot 210, thread A is cache-starved when it runs with thread B and suffers worse performance during time slot 210 than when it runs with thread C in time slot 212.
Consequently, thread A works less efficiently and completes less work per unit of time than it does in the situation shown in FIG. 2A because it does not get an equal share of the cache when running with thread B. As a result, thread A takes longer to complete the same amount of work; its CPU latency 218 is longer than its latency 208 under equal cache sharing.
Co-runner-dependent performance variability can create several problems. For example, it can cause unfair CPU sharing. In particular, conventional schedulers ensure that equal-priority threads get equal shares of CPU time. However, on multi-core processors a thread's share of CPU time, and thus its forward progress, is dependent both upon its time slice and the cache behavior of its co-runners. Benchmark tests have shown that many programs often run much slower with one co-runner than with another co-runner. Another problem is poor priority enforcement. A priority-based scheduler on a conventional processor ensures that elevating the priority of a job results in greater forward progress for that job. On a multi-core processor, if the high-priority job is scheduled with ‘bad’ co-runners, it still may experience inferior rather than superior performance.
Another problem is inadequate CPU accounting. Specifically, on grid-like systems where users are charged for CPU hours conventional scheduling ensures that processes are billed proportionally to the amount of computation accomplished by a job. On multi-core processors, the amount of computation performed in a CPU hour varies depending on the co-runners, so that charging for CPU hours is not accurate.
Thus, to achieve fair cache memory sharing on multi-core processors, such as processor 120, L2 cache memory allocation must be considered. However, fair cache memory sharing in a multi-core processor environment is more difficult than fair cache memory sharing in a shared-memory multiprocessor, where thread performance similarly depends on how much of the shared memory that thread is allocated. The fundamental difference is that the operating system software can observe and control memory allocation in a shared memory multiprocessor, but L2 cache allocation is accomplished with hardware and consequently is completely opaque to the operating system.
One conventional mechanism for insuring fair cache sharing is to use cache memory that can be dynamically partitioned among threads. However, this mechanism requires special hardware that may not be available on all processors. Consequently, an operating system solution is preferable, because it can have a much shorter time-to-market than a hardware solution. Additionally, it may be preferable to implement resource-allocation policies in the operating system rather than in the hardware, because the operating system is responsible for allocating the majority of hardware resources and it has a chance to balance any conflicts that may arise between different allocation policies. Hardware has less flexibility in this respect.
A further conventional software mechanism for insuring fair cache sharing is to implement “co-scheduling”. Co-scheduling aims to select the “right” co-runner for a thread. However, co-scheduling requires the ability to determine how the performance of a thread is affected by a particular co-runner. One problem with this solution is that it may not be possible to find a suitable co-runner. Further, co-scheduling is difficult to implement without inter-core communication, because the decision to schedule a thread on a particular core requires knowledge of what other threads are running on the other cores. Thus, the technique may not scale well as the number of processor cores increases.