1. Field of Invention
The present invention relates generally to the field of computer system memory and pertains more particularly to an apparatus for and a method of memory-affinity process scheduling in CC-NUMA systems.
2. Discussion of the Prior Art
Modem computer systems are often comprised of multiple forms and locations of memory. The memory subsystem is typically organized hierarchically. For example, from cache memory of various levels at the top to main memory and finally to hard disc memory. A processor in search of data or instructions looks first in the cache memory, which is closest to the processor. If the information is not found there, then the request is passed next to the main memory and finally to the hard disc. The relative sizes and performance of the memory units are conditioned primarily by economic considerations. Generally, the higher the memory unit is in the hierarchy the higher its performance and the higher its cost. For reference purposes, the memory subsystem will be divided into xe2x80x9ccachesxe2x80x9d and xe2x80x9cmemory.xe2x80x9d The term memory will cover every form of memory other than caches. Information that is frequently accessed is stored in caches and information that is less frequently accessed is stored in memory. Caches allow higher system performance because the information can typically be accessed from the cache faster than from the memory. Relatively speaking, this is especially true when the memory is in the form of a hard disk.
For example, turning first to FIG. 1, a block diagram of a Cache Coherent Non-Uniform Memory Access (CC-NUMA) system 10 including a network 12 that is interfaced to multiple nodes 14 is shown. In this instance N, where N is greater than or equal to four, nodes are implied by the numbering from Node 0 to Node (Nxe2x88x921). Since in general all of the nodes are alike, only four nodes are shown for convenience. Based on the discussion that follows, one of ordinary skill in the art will realize that the present invention will perform on any system having two or more nodes. Each node includes a processor 16, a cache 18, a memory controller 20, and a memory 22 connected as shown. The memory controller for each node is connected to the network. The network operates based on any conventional protocol.
A cache consists of a cache data portion and a cache tag portion. The cache data portion contains the information that is currently stored in the cache. The cache tag portion contains the addresses of the locations where the information is stored. Generally, the cache data will be larger than the cache tags. The cache data and the cache tags will not necessarily be stored together, depending on the design. When a specific piece of information is requested, one or more of the cache tags are searched for the address of the requested information. Which cache tags are searched will depend on the cache design. If the address of the requested information is present in the cache tags, then the information will be available from that address in the cache data. If the address is not present, then the information may be available from memory.
In general, there are two cache applications that will be considered. First, there are caches integral with or local to a node and interfaced to a processor. Second, there are caches external to or remote from a node and interfaced with a network. Caches must be designed in such a way that their latency meets the timing requirements of the requesting components such as the processor or the network. For example, consider the design of the network. A processor or other agent on the network that requires a specific piece of information will issue what is known as a miss in the form of the address of the information on the network. This leg is known as the address phase. Subsequently, all caches or other agents attached to the network must indicate whether the information at the issued address is located there. This leg is known as the snoop phase. Typically, the network design specifies that the cache must supply its snoop response within a fixed time interval after the address has been issued on the network. If the cache is not designed to satisfy this timing requirement, it will lead to sub-optimal usage of the network, thus lowering system performance.
Of course, remote memory has a longer access time than local memory. On most conventional CC-NUMA systems, the difference in memory latency between a remote miss and a local miss may be a factor of two or greater. The overall system performance can therefore be significantly influenced by the local miss ratio which is defined as:                               Local          ⁢                      xe2x80x83                    ⁢          Miss          ⁢                      xe2x80x83                    ⁢          Ratio                =                                            Number              ⁢                              xe2x80x83                            ⁢              of              ⁢                              xe2x80x83                            ⁢              Local              ⁢                              xe2x80x83                            ⁢              Misses                                      Number              ⁢                              xe2x80x83                            ⁢              of              ⁢                              xe2x80x83                            ⁢              Total              ⁢                              xe2x80x83                            ⁢              Misses                                .                                    (        1        )            
The local miss ratio is influenced by several factors including memory page placement in the memory of the system. Also of influence is the process scheduling of the processor time by the operating system. To ensure fairness among several concurrently executing application programs and to reduce idle time of the processor, the operating system may move a process from one node to another during its execution. Since the node on which the process executes determines whether the cache miss is local or remote, the influence on the local miss ratio of the process scheduling policy can be significant.
Conventional process scheduling policies do not incorporate support for NUMA and are often derived from the traditional Unix scheduling framework. As in the traditional Unix framework, ready processes are placed in one of several run-queues. A distinct set of run-queues exists for every processor. When an application is created, processes are assigned to processors using a round-robin or other such policy. Based on the scheduling policy, processes are chosen from the run-queues for execution on the processors.
Conventional load balancing is performed during the execution of the application. At each load balance event, the number of processes in the run-queues of each processor is examined. If the variation in the load between the processors is sufficiently high, then a process is moved from the highest loaded processor to a lesser loaded processor. Apart from such synchronized load balancing, a processor can also steal a process from the run-queues of another processor if its own run-queues are empty.
Some of the conventional process scheduling policies attempt to place a process on the same processor on which it last executed. This allows for reuse of cache contents and is known as processor- or cache-affinity scheduling. In a NUMA system, it is also important that the process is close to the memory pages that it uses. The synchronized load balancing and process stealing mechanisms in conventional operating systems can result in a process being moved far away from their memory pages. Such scheduling policies can lead to performance degradation in a NUMA system.
A definite need exists for a system having an ability to adapt to changes in the memory access pattern of a process. In particular, a need exists for a system which is capable of tracking the access pattern of a process during run-time. Ideally, such a system would have a lower cost and a higher productivity than conventional systems. With a system of this type, system performance can be enhanced. A primary purpose of the present invention is to solve this need and provide further, related advantages.
An apparatus for and a method of memory-affinity process scheduling in CC-NUMA systems is disclosed. The system includes a plurality of nodes connected to a network. A plurality of processes are running on the various nodes of the system. The system further includes at least one memory-affinity counter for each executing process for each node of the system. Process scheduling begins by assigning processes to nodes. During execution, the memory-affinity counters are incremented on every memory access. At a process rescheduling interval, the memory-affinity counters are evaluated and rescheduling is performed based on a preselected policy. At a reset interval, the memory-affinity counters are adjusted to reduce the impact of older memory accesses. The resulting memory-affinity process scheduling is NUMA aware.