The present invention relates to scheduling software threads on multiprocessor systems with shared hardware resources. Scheduling (sometimes called placement) software is usually part of an operating system that runs on the computer systems that it manages. A software thread is a self-contained sequence of program instructions that may work with a self-contained set of data values. Multiple software threads running in a single software program may also share data.
Traditionally, processor chips such as used in computers and many other electronics systems had a single processor core, including a central processing unit (CPU), an instruction pipeline, and usually a cache memory. In the context of this document, a CPU may also be referred to as a strand, where the strand contains the execution state of a running software thread and may include a set of registers.
As processor technology evolved, processor manufacturers introduced processor cores with multiple strands that share common resources such as the instruction pipeline and the cache memory. Each of the multiple strands could run a thread, so that multiple threads could be executed concurrently on one processor core. This technique is called Simultaneous Multithreading (SMT).
Evolution in processor technology also led to processor chips that included multiple processor cores, each with a single strand, an instruction pipeline, and a first-level cache, with the multiple cores often sharing a second-level cache. This technique is called Chip Multiprocessing (CMP).
Many modern processors combine both SMT and CMP in a single chip with multiple processor cores and multiple strands per core. Each core typically has its own dedicated instruction pipeline and first-level cache, while second- and/or third-level caches are often shared by some or all cores of the chip. This technique is sometimes referred to as Chip Multithreading (CMT).
Processors may include other performance-relevant hardware components on the chip such as translation lookaside buffers (TLB), floating-point units, graphics units, co-processors, cryptographic units, accelerators, or memory controllers. Each of these resources may be integrated into each core or shared by a group of cores or all cores of the chip.
U.S. Pat. No. 8,156,495, “Scheduling Threads on Processors” by Chew and Saxe, describes a processor group (PG), along with an abstraction to model the (potentially hierarchical) resource-sharing relationships of modern SMT/CMP processors. A PG is a group of CPUs (strands) that share one or more performance-relevant hardware resources. Multiprocessor hardware may be modeled as a hierarchical tree of PGs to describe simple or complex sharing relationships, for example CPUs of a core sharing a common execution pipeline and first-level cache in leaf PGs, or CPUs of multiple cores sharing a common second-level cache (but different execution pipelines and first-level caches) in an intermediate or root PG.
Operating systems implement schedulers and dispatchers (commonly referred to as schedulers in this document) that place software threads onto hardware strands (or CPUs) for execution. Traditional schedulers have treated all CPUs of a computer system identically and ignored the performance-relevant hardware-sharing relationships of CPUs (for example, some CPUs sharing a particular hardware resource while some other CPUs do not).
This problem has been recognized for some time. Various parties have created approaches to address the problem. One such approach has been described in the above patent by Chew and Saxe, using PG modeling to facilitate thread scheduling while considering the hardware resource sharing relationships of CPUs. In this approach, the usage of a PG is defined as the number of running threads in that PG, and is incremented by one (for a leaf PG and all its parent PGs) when a thread starts to execute in that leaf PG, and decremented by one (for a leaf PG and all its parent PGs) when a thread stops to execute in that leaf PG. The capacity of a PG is defined as the number of strands (CPUs) in that PG. This approach further implements two load-balancing policies to determine the best PG for a thread to execute on, either traversing the PG hierarchy top-down to globally balance utilization, or bottom-up starting the search with the PG the thread last executed on to optimize for locality. For both policies, at each level of the PG hierarchy, the utilization of the PG under consideration is compared with one or multiple of its sibling PGs, each time choosing the lower-utilized PG (that is, the PG with the lower running-thread count).
However, even though some current operating system schedulers are aware of the hardware resource sharing-relationships of the CPUs in the system, they only implement scheduling policies (such as load-balancing) based on running thread count (“software utilization”) in each PG, but do not consider the actual resource usage of the PG's hardware components such as execution pipelines and caches. Furthermore, they assume all software threads to be identical with respect to hardware resource consumption, ignoring that some threads may have for example a higher demand for some hardware resource while other threads have a higher demand for other hardware resources. For example, one thread may have a higher demand for execution pipeline, resulting in a high rate of committed instructions per cycle (IPC), while another thread may have a higher demand for cache and memory, resulting in more memory-related stalls and consequently a lower rate of committed instructions per cycle (low IPC).
If scheduling policies do not consider the resource utilization of shared hardware components and the hardware resource demand of threads, thread scheduling decisions may be suboptimal. For example, schedulers may place two threads with a high demand for execution pipeline on the same core, and place two threads with a high rate of memory accesses onto another core. If each core has a dedicated execution pipeline and first-level cache, such a scheduling may lead to contention on the execution pipeline on one core and contention or a high first-level cache miss rate on the other core, while the complementary resource (first-level cache on the first core, execution pipeline on the second core) might remain underutilized. Such a thread scheduling may result in poor performance for applications as their thread's performance is degraded due to contention on shared hardware resources and the processor's resources are used suboptimally.
Current processors are built with CPU hardware performance counters (CPCs) that provide information regarding the usage or utilization of the various shared hardware resources (“hardware usage” or “hardware utilization”). Through sampling of these counters, a scheduler may also obtain information about the hardware resource consumption of threads or applications executing on a system.
Previous research has proposed deploying applications on so-called staging systems to obtain their hardware resource consumption profiles, and then use those profiles to optimize their scheduling on production systems. However, this approach may be impractical because the cost or effort of running and profiling an application on a staging system is not feasible. The application characteristics may change over time or its traffic pattern or type of use is unknown upfront, or staging and production systems may be based on different hardware platforms or generations. Especially with the evolution of cloud computing where application ownership is in the hands of a tenant, while the application scheduling is performed by the service provider, a dedicated staging or profiling phase is often impractical.
Furthermore, application may be heterogeneous, in themselves comprised of threads with different hardware resource requirements. While scheduling of applications is an infrequent task (for example during application deployment), scheduling of threads may need to be performed at every context switch, which is potentially thousands or millions of times each second. The profiling of individual threads in isolation is even more cumbersome than the profiling of entire applications in isolation.