1. Related Applications
This is related to patent application Ser. Nos. 10/231,618 and 10/232,199 (filed on even date herewith) which have substantially the same disclosure but claim different inventions. These different inventions solve related problems in different ways, but they are incorporated herein by this reference.
2. Field of the Invention
This invention relates generally to the field of managing tasks that instruction processors are assigned to do within a computer system having multiple instruction processors, and particularly to allow for purposeful and efficacious allocation of instruction processor resources among various ones of the tasks.
3. Background Information
Background on the problems encountered in this art of tuning multiprocessor computer systems for preferred performance is relevant to understanding the solutions described herein. In the field of multiprocessor computer systems, it can be difficult to strike the right balance of work assignments between and among the processors so that the computing tasks are accomplished in an efficient manner with a minimum of overhead for accomplishing the assigning of tasks. This appropriate balance will vary considerably depending on the needs of the system's users and to some extent upon the system architectures. User controls therefore are sometimes necessary and must be able to work appropriately with automatic load balancing that will be done by the operating system. As server consolidation programs progress, large multiprocessor computer systems appear to be becoming more prevalent and this art should become correspondingly more important.
The preferred design should not allow a majority of the available tasks to be assigned to a single processor (nor to any other small subset of all processors). If this occurs, the small subset of processors is kept too busy to accomplish all its tasks efficiently while others are waiting relatively idle with few or no tasks to do. Thus the system will not operate efficiently. It should therefore have a load balancing or work distribution scheme to be efficient.
Multiprocessor systems are usually designed with cache memories to alleviate the imbalance between high performance processors and the relatively slow main memories. Cache memories are physically closer to their processors and so can be accessed more quickly than main memory. They are managed by the system's hardware and they contain copies of recently accessed locations in main memory. Typically, a multiprocessor system includes small, very fast, private cache memories adjacent to each processor, and larger, slower cache memories that may be either private or shared by a subset of the system's processors. The performance of a processor executing a software application depends on whether the application's memory locations have been cached by the processor, or are still in memory, or are in a close-by or remote processor's cache memory. To take advantage of cache memory (which provides for quicker access to data because of cache's proximity to individual processors or groups of processors) an assignment of tasks based on affinity with a processor or processor group that has the most likely needed data already in local cache memory(ies) to bring about efficiencies should also be part of the design. As is understood in this art, where a processor has acted on part of a problem (loading a program, running a transaction, or the like), it is likely to reuse the same data or instructions present in its local cache, because these will be found in the local cache once the problem is begun. By affinity we mean that a task, having executed on a processor, will tend to execute next on that same processor or a processor with fast access to the cached data. (Tasks begun may not complete due to a hardware interrupt or for various other well-understood reasons not relevant to our discussion). Where more than one processor shares a cache or a cache neighborhood, the design for affinity assignment could become complicated, and complexity can be costly, so the preferred design should be simple. (A group of processors belonging to what is generally referred to as having been affinitized we refer to as a “Processor Group”).
(Language in the computer arts is sometimes confusing as similar terms mean different things to different people and even to the same people in different contexts. Here we use the word “task” as indicating a process. Tasks are often thought of as consisting of multiple independent threads of control any of which could be assigned to different processor groups, but we will use the word task more simply, referring to a single process when we use the word).
These two goals, affinity and load balancing, seem to be in conflict. Permanently retaining task affinity could lead to overloading some processors or groups of processors. Redistributing tasks to processors to which they have no affinity will yield few cache hits and slow down the processing overall. These problems get worse as the size of the multiprocessor computer systems gets larger.
Typically, computer systems use switching queues and associated algorithms for controlling the assignment of tasks to processors. Typically, these algorithms are considered an Operating System (OS) function. When a processor “wants” (is ready for) a new task, it will execute the (usually) re-entrant code that embodies the algorithm that examines the switching queue. (This code is commonly called a “dispatcher.”) It will determine the next task to do on the switching queue and do it. However, while it is determining which task to do, other processors that share the switching queue may be waiting for access to the switching queue, which the first processor will have locked in order to do the needed determination (using the dispatcher code).
Accordingly, there is a great need for efficient dispatcher programs and algorithmic solutions for this activity in multiprocessor computer systems.
The inventors herein have developed some solutions to these and other problems in this field, which are described in detail in U.S. patent applications with Ser. Nos. 09/920,023 and 09/038,547 (both being incorporated herein in their entireties by this reference), but which still leave room for improvement.
In summary, these prior documents describe the use of one switching queue per processor to minimize the overheads of task selection in an environment that supports task affinity. Load balancing is addressed partially by the use of round-robin scheduling mechanisms when a task is created. The task is allocated to the next idle processor in the system, if there is one, and to the next busy processor if not. To balance the load of existing tasks across the processors, the OS keeps track of how busy each processor is, averaged over some period of time (generally a fraction of a second). If a processor becomes idle and has an empty switching queue then it may look to see if it should or can “steal” a task from another processor's queue. (The stealing processor then executes the “stolen” task). The stealing decision is based on thresholds. For stealing to occur at all, the other processor (being potentially stolen from) must be busier than a threshold value. An idle processor may then steal from another's queue if the other is busier on average and the difference in relative busyness exceeds a further threshold. That threshold value depends on the cache neighborhood they share. If the two share a small cache neighborhood (for example, on the same bus), then the overhead of switching the task to that processor is low, so the threshold is set correspondingly low. For processors in the larger cache neighborhoods, (for example, on different crossbars) the thresholds are higher. The intent is to balance the system load while minimizing the overhead of fetching cached data from remote cache memories.
The inventions described in this document seek to further optimize the performance of the system's dispatching and affinity mechanisms for the user's applications. We address the problem of balancing the needs of performance (through affinity) with those of enforcing the priorities of task execution. (Task execution priorities are set by a user and a system should respond positively to such user needs). We also seek to further improve the performance of the user's applications by locating shared written-to data in a cache where it can be most efficiently accessed by all the sharing tasks.
(Just for clarity of explanation and simplicity, we use the all-inclusive term “user” throughout this document, generally to define a privileged actor who has rights through the security barriers of the computer system and the OS to perform tasks of the nature and risk to the system as the utility functions we describe herein. It may be one or several individuals or even a program that has such privileges and performs such functions. We also, where appropriate, may refer to this “user” as a systems administrator.)
Priority
There is a natural conflict between the requirements of (execution) priority and efficiency. For example, the most efficient task to execute, one that recently ran on a processor and so has data residue in the memory cache, may be of lower priority than other less efficient alternatives. Some users, possibly just for specific applications or tasks, may prefer the efficiencies of affinity to the rigorous application of priority because the system throughput may be more important than the priority of a given set of tasks. Others may construct applications that are dependent on the rigorous application of priority for proper operation. Use of the invention of this patent permits the user to profitably tailor the relationship between affinity and priority within a computer system to meet this range of concerns.
Data Sharing
In a system with many caches, some local to individual processors and some shared between processors, performance can be considerably enhanced by ensuring that shared data which is frequently updated is located in caches that can be efficiently accessed by all the tasks that share it. Our invention gives the user the tools for identifying the sharing tasks and by using such tools, conveniently confining the execution of these sharing tasks to the processors that are in the neighborhood of such a cache.
The general terminology used with this document assumes some basic pattern of computer organization and can be illustrated in a chart as follows.