Historically, a major goal of multiprocessor (MP) computers has been to use the processors to collectively reduce the latency of jobs. This is particularly true of computationally intensive jobs in scientific applications which can often be parallelized across the multiple processors very efficiently. Even in commercial MP servers, much effort has been expended to make operating systems and applications "multi-threaded" so as to parallelize work across multiple processors, decreasing the latency of the computations. However, for MP computers serving hundreds or thousands of simultaneous users, the vast majority of jobs are not computationally intensive, and the throughput of the server, as well as the latency of the individual jobs, can best be achieved by giving the each individual job "affinity" to a given CPU, so that all of the instructions and data associated with the individual job can remain readily accessible in a local cache memory.
One such method of scheduling processes is disclosed in U.S. Pat. No. 5,185,861 to Valencia, issued Feb. 9, 1993, and entitled "Cache Affinity Scheduler". In that patent, an affinity scheduler for a multiprocessor computer system is disclosed. The affinity scheduler allocates processors to processes or jobs and schedules the processes to run based upon the bases of priority and processor availability. The scheduler uses the estimated amount of cache context to decide which run queue a process is to be enqueued. U.S. Pat. No. 5,185,861 is hereby incorporated by reference. Hereafter in this description, the method for allocating processors to processes will be referred to as a "process affinity scheduler", indicating that it is the user process (or "job" or "thread") that is given affinity to a processor.
It is believed that additional performance benefits can be gained by expanding the existing process affinity scheduler methods to encompass the idea of "data affinity scheduling", wherein data is given affinity to a processor and jobs or processes that are manipulating that data extensively should be run on that processor. To appreciate this performance opportunity, it must first be noted that only one processor within a multiprocessor system can be modifying a shared memory location at any given time in order to maintain memory coherency between the processors and various cache memories. When more than one user job is manipulating the same data, the data modified by a first job affinitized to a first processor (CPU A) must be transferred following modification to the second job affinitized to a second processor (CPU B). This transfer can be time consuming depending on the communication mechanism between the processors, the amount of data to be transferred, and the contention for the communication resource between the processors.
The assertion that data affinity increases performance, at least in certain applications, may seem counter-intuitive because computations on data that were previously done in parallel are now serialized. However, as the number of processors in a system increases, so does the communication and synchronization overhead between the processors. This overhead increase is independent of whether the communication mechanism is a shared bus, such as used by shared memory multiprocessor systems, or multiple interconnection paths, such as used by massively parallel processing (MPP) computers, because the contention for the interconnection path increases as the number of processors and the number of simultaneous users increase. Higher bus contention results in longer access latencies for transferring data blocks between processors. In addition, when the processes need to synchronize execution via a semaphore operation, the overhead of acquiring and releasing a lock cell is much greater when its memory cell is being "thrashed" between two processors. By running both synchronizing processes on the same processor, not only is greater performance obtained by caching the semaphore locally, but a code-path reduction is achieved since it is now impossible for both processes to be attempting to acquire the lock simultaneously. It is therefore believed that, for certain classes of jobs, there is a cross-over point where the latency of jobs will be SHORTER by running them in serial on a processor where they can share the common data and synchronization structures locally than it would be by running the jobs in parallel on different processors where the application and synchronization data must be transferred over the interconnection path between the processors.