1. Field of the Invention
This invention relates generally to the field of managing tasks that instruction processors are assigned to do within a computer system having multiple instruction processors, and particularly to allow for purposeful and efficacious allocation of instruction processor resources among various ones of the tasks.
2. Background Information
Background on the problems encountered in this art of tuning multiprocessor computer systems for preferred performance is relevant to understanding the solutions described herein. In the field of multiprocessor computer systems, it can be difficult to strike the right balance of work assignments between and among the processors so that the computing tasks are accomplished in an efficient manner with a minimum of overhead for accomplishing the assigning of tasks. This appropriate balance will vary considerably depending on the needs of the system's users and to some extent upon the system architectures. User controls therefore are sometimes necessary and must be able to work appropriately with automatic load balancing that will be done by the operating system. As server consolidation programs progress, large multiprocessor computer systems appear to be becoming more prevalent and this art should become correspondingly more important.
The preferred design should not allow a majority of the available tasks to be assigned to a single processor (nor to any other small subset of all processors). If this occurs, the small subset of processors is kept too busy to accomplish all its tasks efficiently while others are waiting relatively idle with few or no tasks to do. Thus the system will not operate efficiently. It should therefore have a load balancing or work distribution scheme to be efficient.
Multiprocessor systems are usually designed with cache memories to alleviate the imbalance between high performance processors and the relatively slow main memories. Cache memories are physically closer to their processors and so can be accessed more quickly than main memory. They are managed by the system's hardware and they contain copies of recently accessed locations in main memory. Typically, a multiprocessor system includes small, very fast, private cache memories adjacent to each processor, and larger, slower cache memories that may be either private or shared by a subset of the system's processors. The performance of a processor executing a software application depends on whether the application's memory locations have been cached by the processor, or are still in memory, or are in a close-by or remote processor's cache memory. To take advantage of cache memory (which provides for quicker access to data because of cache's proximity to individual processors or groups of processors) an assignment of tasks based on affinity with a processor or processor group that has the most likely needed data already in local cache memory(ies) to bring about efficiencies should also be part of the design. As is understood in this art, where a processor has acted on part of a problem (loading a program, running a transaction, or the like), it is likely to reuse the same data or instructions present in its local cache, because these will be found in the local cache once the problem is begun. By affinity we mean that a task, having executed on a processor, will tend to execute next on that same processor or a processor with fast access to the cached data. (Tasks begun may not complete due to a hardware interrupt or for various other well-understood reasons not relevant to our discussion). Where more than one processor shares a cache or a cache neighborhood, the design for affinity assignment could become complicated, and complexity can be costly, so the preferred design should be simple. (A group of processors belonging to what is generally referred to as having been affinitized we refer to as a “Processor Group”).
(Language in the computer arts is sometimes confusing as similar terms mean different things to different people and even to the same people in different contexts. Here we use the word “task” as indicating a process. Tasks are often thought of as consisting of multiple independent threads of control any of which could be assigned to different processor groups, but we will use the word task more simply, referring to a single process when we use the word).
These two goals, affinity and load balancing, seem to be in conflict. Permanently retaining task affinity could lead to overloading some processors or groups of processors. Redistributing tasks to processors to which they have no affinity will yield few cache hits and slow down the processing overall. These problems get worse as the size of the multiprocessor computer systems gets larger.
Typically, computer systems use switching queues and associated algorithms for controlling the assignment of tasks to processors. Typically, these algorithms are considered an Operating System (OS) function. When a processor “wants” (is ready for) a new task, it will execute the (usually) re-entrant code that embodies the algorithm that examines the switching queue. (This code is commonly called a “dispatcher.”) It will determine the next task to do on the switching queue and do it. However, while it is determining which task to do, other processors that share the switching queue may be waiting for access to the switching queue, which the first processor will have locked in order to do the needed determination (using the dispatcher code).
Accordingly, there is a great need for efficient dispatcher programs and algorithmic solutions for this activity in multiprocessor computer systems.
The inventors herein have developed some solutions to these and other problems in this field, which are described in detail in U.S. patent applications with Ser. Nos. 09/920,023 and 09/038,547 (both being incorporated herein in their entireties by this reference), but which still leave room for improvement.
In summary, these prior documents describe the use of one switching queue per processor to minimize the overheads of task selection in an environment that supports task affinity. Load balancing is addressed partially by the use of round-robin scheduling mechanisms when a task is created. The task is allocated to the next idle processor in the system, if there is one, and to the next busy processor if not. To balance the load of existing tasks across the processors, the OS keeps track of how busy each processor is, averaged over some period of time (generally a fraction of a second). If a processor becomes idle and has an empty switching queue then it may look to see if it should or can “steal” a task from another processor's queue. (The stealing processor then executes the “stolen” task). The stealing decision is based on thresholds. For stealing to occur at all, the other processor (being potentially stolen from) must be busier than a threshold value. An idle processor may then steal from another's queue if the other is busier on average and the difference in relative busyness exceeds a further threshold. That threshold value depends on the cache neighborhood they share. If the two share a small cache neighborhood (for example, on the same bus), then the overhead of switching the task to that processor is low, so the threshold is set correspondingly low. For processors in the larger cache neighborhoods, (for example, on different crossbars) the thresholds are higher. The intent is to balance the system load while minimizing the overhead of fetching cached data from remote cache memories.
The inventions described in this document seek to further optimize the performance of the system's dispatching and affinity mechanisms for the user's applications. We address the problem of balancing the needs of performance (through affinity) with those of enforcing the priorities of task execution. (Task execution priorities are set by a user and a system should respond positively to such user needs). We also seek to further improve the performance of the user's applications by locating shared written-to data in a cache where it can be most efficiently accessed by all the sharing tasks.
(Just for clarity of explanation and simplicity, we use the all-inclusive term “user” throughout this document, generally to define a privileged actor who has rights through the security barriers of the computer system and the OS to perform tasks of the nature and risk to the system as the utility functions we describe herein. It may be one or several individuals or even a program that has such privileges and performs such functions. We also, where appropriate, may refer to this “user” as a systems administrator.)
Priority
There is a natural conflict between the requirements of (execution) priority and efficiency. For example, the most efficient task to execute, one that recently ran on a processor and so has data residue in the memory cache, may be of lower priority than other less efficient alternatives. Some users, possibly just for specific applications or tasks, may prefer the efficiencies of affinity to the rigorous application of priority because the system throughput may be more important than the priority of a given set of tasks. Others may construct applications that are dependent on the rigorous application of priority for proper operation. Use of the invention of this patent permits the user to profitably tailor the relationship between affinity and priority within a computer system to meet this range of concerns.
Data Sharing
In a system with many caches, some local to individual processors and some shared between processors, performance can be considerably enhanced by ensuring that shared data which is frequently updated is located in caches that can be efficiently accessed by all the tasks that share it. Our invention gives the user the tools for identifying the sharing tasks and by using such tools, conveniently confining the execution of these sharing tasks to the processors that are in the neighborhood of such a cache.
The general terminology used with this document assumes some basic pattern of computer organization and can be illustrated in a chart as follows.
                SYSTEM HARDWARE DESIGNCache Neighborhoods determined by:        Instruction Processors (IPs) Caches, System Topology, & Cache/memory access times.        USER'S SYSTEM CONFIGURATION        Instruction Processors & Caches, installed and enabled.        (THIS IS DYNAMICALLY VARIABLE)        PROCESSOR GROUPS        User's system's processors within a cache neighborhood.        SWITCHING QUEUES        Queues of tasks waiting to execute on one or more processors        DATA SHARING GROUPS        Groups of programs that are likely to share memory-resident data.        USER'S APPLICATION MIX        A mix of applications with different degrees of importance and varying data-sharing habits        (THIS IS DYNAMICALLY VARIABLE)Background Chart        
For any instance of a system, the system's hardware design (as shown in the Chart above) is fixed. A given architecture has a maximum number of processors and private and shared caches. The system's topology and hardware components result in a set of access times between processors, caches, and the memory. These design features should be expressed as a configuration message to any affinity algorithms. A given collection of processors, their internal private caches, and any shared caches is referred to as a “cache neighborhood”.
The user's system configuration (as shown in the Chart) will usually be a subset of the maximum designed configuration. The physical configuration in use may be also be dynamically changed while the system is running, by the operator or automatically, and may involve adding or removing processors and caches. The effective performance of the processors and the system will change as cache neighborhoods expand and contract or processors are added and removed.
The performance characteristics of the user's application mix (as shown in the Chart) depend on the applications' data sharing characteristics and the caching neighborhoods in which they execute. To maintain cache coherency for a unit of data that is (or may be) updated by multiple tasks running on multiple processors, only one processor is allowed to update the data at a time. The next processor that requires the data may have to fetch it from another processor's cache. This fetching is much faster if the processor is close by. Hence there is advantage in confining all the tasks that share some data to the processors of a small cache neighborhood. In our invention described in the above referenced application Ser. No. 09/020,023 the selection of a processor for a new task uses a round-robin technique to help in balancing the system's load across all processors, so the tasks that share an application's data may be spread over a wide cache neighborhood. The current invention seeks to optimize application performance by scheduling the application's tasks onto the smallest sufficient cache neighborhood. This optimization requires information from the user, as only the user is aware of the application's data sharing characteristics, its importance to the user, its need for processor power, its interactions with other applications, and the relationship between applications and tasks. The application load and mix may also vary significantly with time or day, so the user may need to also vary the application information he supplies concomitantly with any contingency or in advance for known load variance.
An appropriate solution therefore should give the user the tools for optimizing the performance of the user's application mix. The tools should competently deal with a dynamically changing application mix and a dynamically changing system hardware configuration.
Also, giving the user some control over allocation of instruction processor resources is generally a positive thing, if the operating system has some override capacity to protect itself and some of its built-in efficiencies. Microsoft has also provided for some limited and static instances a method of dedicating processor-program affinity to individual processors or groups of processors, but to date, their system lacks any intelligent automatic oversight, enabling user control to actually hurt system performance if not done correctly. {See the document at http://support.microsoft.com/directory/article.asp?ID=KB;EN-US;Q299641, indicating how this is accomplished for SQL or find additional related background at http://www.microsoft.com/hwdev/platform/proc/SRAT.asp, which describes the Microsoft Static Resource Affinity Table. (This second site is available under click-through license to Microsoft only, and was last known updated on Jan. 15, 2002).}
Overall, if the User can modify performance of the system to meet the user's day-to-day needs, the computer system will be better accepted. However, these user-initiated changes need to take into consideration the architectural features of the computer system and their limitations in order to be effective.
Accordingly, the inventors herein describe a computer system improvement directed primarily to improving the efficiency and usability of multiprocessor computer systems having affinity dispatchers.