FIG. 1 is a block diagram of major subsystems of a prior art symmetrical multi-processor computing system 10. Examples of system 10 are models S27 and S81 of the Symmetry Series manufactured by Sequent Computing Systems, Inc., of Beaverton, Oreg., the assignee of the present patent application. The Symmetry Series models employ a UNIX operating system with software written in the C programming language. The UNIX operating system, which is well known to those skilled in the art, is discussed in M. Bach, The Design of the UNIX Operating System, Prentice Hall, 1986. The C programing language, which is also well known to those skilled in the art, is described in B. Kernighan and D. Ritchie, The C Programming Language, 2d Ed., Prentice Hall, 1988.
Referring to FIG. 1, system 10 includes N number of computing engines denominated engine 1.sub.P, engine 2.sub.P, engine 3.sub.P, . . . , engine N.sub.P (collectively "engines 1.sub.P -N.sub.P "). Each one of engines 1.sub.P -N.sub.P is a hardware computer which includes a microprocessor such as an Intel 386 and associated circuitry. System 10 is called a symmetrical multi-processor computing system because each one of engines 1.sub.P -N.sub.P has equal control over system 10 as a whole.
Each one of engines 1.sub.P -N.sub.P has a local cache memory denominated cache 1.sub.P, cache 2.sub.P, cache 3.sub.P, . . . , cache N.sub.P, respectively (collectively "caches 1.sub.P -N.sub.P "). The purpose of a cache memory is to provide high-speed access and storage of data associated with processes performed by an engine. A system bus 12.sub.P joins engines 1.sub.P -N.sub.P to a main RAM memory 14.sub.P of system 10. Data stored in one of cache memories 1.sub.P -N.sub.P can originate from the corresponding engine, the cache memory of another engine, a main memory 14.sub.P, a hard disk controlled by a disk controller 18, or an external source such as terminals through a communications controller 20.
Cache memories 1.sub.P -N.sub.P are each organized as a pseudo least-recently-used (LRU) set associative memory. As new data are stored in one of the cache memories, previously stored data are pushed down the cache memory until the data are pushed out of the cache memory and lost. Of course, the data can be copied to main memory 14 or another cache memory before the data are pushed out of the cache memory. The cache context of a process with respect to an engine "erodes" as data associated with a process is pushed out of the cache memory of the engine.
A scheduler determines which engine will run a process, with a highest priority process running first. On a multi-processor system, the concept of priority is extended to run the highest n number of priority processes on the m number of engines available, where m=n unless some of the engines are idle. In system 10, the scheduler is a software function carried on by the UNIX operating system that allocates engines to processes and schedules them to run on the bases of priority and engine availability. The scheduler uses three distinct data structures to schedule processes: (1) a global run queue 34 (FIG. 2), (2) a floating point accelerator (FPA) global run queue, and (3) an engine affinity run queue 38 (FIG. 3) for each engine. The FPA global run queue is the same as global run queue 34 except that the FPA global run queue may queue processes requiring FPA hardware.
FIG. 2 illustrates the prior art global run queue 34 used by system 10. Referring to FIG. 2, global run queue 34 includes an array qs and a bit mask whichqs. Array qs is comprised of 32 pairs (slots) of 32-bit words, each of which points to one linked list. The organization of the processes is defined by data structures. Each slot includes a ph.sub.-- link field and a ph.sub.-- rlink field, which contain address pointers to the first address of the first process and the last address of the last process, respectively, in a double circularly linked list of queued processes. The 32 slots are arranged in priority from 0 to 31, as listed at the left side of FIG. 2, with priority 0 being the highest priority.
The bit mask whichqs indicates which slots in the array qs contain processes. When a slot in qs contains a process, whichqs for that slot contains a "1". Otherwise, the whichqs for that slot contains a "0". When an engine looks for a process to run, the engine finds the highest priority whichqs bit that contains a 1, and dequeues (i.e., detaches) a process from the corresponding slot in qs.
FIG. 3 illustrates the affinity run queue 38 used by system 10. Under default condition, after a process becomes runnable, it is enqueued (i.e., joined) to either global run queue 34 or the FPA global run queue, rather than to an affinity run queue 38. However, a process may be "hard affinitied" to run only on a particular engine. In that case, the process is enqueued only to the affinity queue 38 for that engine (rather than a global run queue) until the hard affinity condition is ended. Of course, there are times, for example, when it is asleep, when a hard affinitied process is not queued to any run queue.
Referring to FIG. 3, each one of engines 1.sub.P -N.sub.P is associated with its own affinity run queue 38. Each affinity run queue 38 has an e.sub.-- head field and an e.sub.-- tail data structure, which contain the address pointers to the first and last address, respectively, of the first and last processes in a doubly circularly linked list of affinitied processes. When a process is hard affinitied to an engine, it is enqueued in FIFO manner to the double circularly linked list of the affinity run queue 38 that corresponds to the engine. The FIFO arrangement of each linked list is illustrated in FIG. 3. Each of the linked lists of the slots of global run queue 34, shown in FIG. 2, has the same arrangement as the linked list shown in FIG. 3.
Affinity run queue 38 differs from global run queue 34 in the following respects. First, global run queue 34 has 32 slots and can, therefore, accommodate 32 linked lists. By contrast, each affinity run queue 38 has only one slot and one linked list. Second, as a consequence of each affinity run queue 38 having only one linked list, the particular engine corresponding to the affinity run queue 38 is limited to taking only the process at the head of the linked list, even though there may be processes having higher priority in the interior of linked list. Third, the linked list of processes must be emptied before the engine can look to a global run queue for additional processes to run. Accordingly, affinity run queue 38 does not have a priority structure and runs processes in round robin fashion. Fourth, as noted above, only hard affinitied processes are enqueued to affinity queue 38.
Because of the non-dynamic and explicitly-invoked nature of hard affinity, it is used mostly for performance analysis and construction of dedicated system configurations. Hard affinity is not used in many customer configurations because the inflexibility of hard affinity does not map well to the complexity of many real-world applications.
The lifetime of a process can be divided into several states including: (1) the process is "runnable," (i.e., the process is not running, but is ready to run after the scheduler chooses the process), (2) the process is executing (i.e., running) on an engine, and (3) the process is sleeping. The second and third states are well known to those skilled in the art and will not be described herein in detail.
When a process is runnable, the scheduler first checks to see whether the process is hard affinitied to an engine. If the process is hard affinitied, it is enqueued onto the FIFO linked list of the affinity run queue 38 associated with the engine.
If the process is not hard affinitied to an engine and is not marked for FPA hardware, then the process is queued on one of the linked lists of qs in global run queue 34 according to the priority of the process. The appropriate bit in whichqs is updated, if necessary. If the process is not affinitied, but has marked itself as requiring FPA hardware, the process is queued to one of the linked lists of fpa.sub.-- qs in the FPA global run queue, and the appropriate bit in fpa.sub.-- whichqs is updated, if necessary, where fpa.sub.-- qs and fpa.sub.-- whichqs are analogous to qs and whichqs.
When an engine is looking for a process to run, the engine first examines its affinity run queue 38. If affinity run queue 38 contains a process, the first process of the linked list is dequeued and run. If affinity run queue 38 for a particular engine is empty, then the scheduler examines whichqs of global run queue 34 and fpa.sub.-- whichqs of the global FPA run queue to see whether processes are queued and at what priorities. The process having the higher priority runs first. If the highest priorities in whichqs of global run queue 34 and fpa.sub.-- whichqs in the FPA global run queue are equal, the process in the FPA format runs first.
A goal of system 10 is to achieve a linearly increasing level of "performance" (i.e., information processing capacity per unit of time), as engines and disk drives are added. An obstacle to meeting that goal occurs when there is insufficient bus bandwidth (bytes per unit time period) to allow data transfers to freely flow between subsystem elements. One solution would be to increase the bandwidth in system bus 12.sub.P. However, the bandwidth of system bus 12.sub.P is constrained by physical cabinet and connector specifications.
The problem of inadequate bandwidth is exacerbated because system 10 allows customers to add additional disk drives and engines to increase the value of the number N.sub.P after the system is in the field. In addition, in system 10, engines 1.sub.P -N.sub.P and the disk drives may be replaced with higher performance engines and disk drives. Increased engine performance increases the number of instructions processed per operating system time slice. This in turn requires larger cache memories on the processor boards in an effort to reduce memory-to-processor traffic on the main system bus. However, cache-to-cache bus traffic increases along with the cache memory size thereby frustrating that effort. Likewise, adding multiple disk drives increases the requirement for disk I/O bandwidth and capacity on the bus.
When a process moves from a previous engine to a new engine, there is some cost associated with the transition. Streams of cache data move from one engine to another, and some data are copied from main memory 14. Certain traffic loads, database traffic loads in particular, may result in a bus saturation that degrades overall system performance. Data are transferred over system bus 12.sub.P as data are switched from main memory 14.sub.P or the previous cache memory to the new engine and the new cache memory. However, each time a process runs from the global run queue, the odds that the process will run on the same engine as before approaches 1/m, where m is the number of active on-line engines. On a large system, m is usually 20 or more, giving less than a 5% chance that the process will run on the same engine as before. However, it is difficult to accurately characterize the behavior of an operating system. The actual odds will, of course, depend on CPU and I/O load and the characteristics of the jobs running.
In many situations, it is desirable for an engine to stop running an unfinished process and perform another task. For example, while an engine is running one process, a higher priority process may become runnable. The scheduler accommodates this situation through a technique called "nudging." A nudge is a processor-to-processor interrupt that causes a destination processor to re-examine its condition and react accordingly. In the case of a higher priority process, the "nudged" destination engine will receive the interrupt, re-enter the operating system, notice that there is higher priority work waiting, and switch to that work. Each nudge has a corresponding priority value indicating the priority of the event to which the engine responds. As an optimization, the priority of the nudge pending against an engine is recorded per engine. When nudge is called for priority less than or equal to the value already pending on the engine, the redundant nudge is suppressed.
When a process becomes runnable, the scheduler scans engines 1.sub.P -N.sub.P for the engine(s) running the lowest priority process(es). If the newly runnable process has an equal or greater priority than the presently running processes, the engine (or one of the engines) with the highest priority (e.g., engine 1.sub.P) is "nudged" to reschedule in favor of the newly runnable process. Consequently, a process (e.g., process X) ceases to run on engine 1.sub.P, at least temporarily. During the time other processes are running on engine 1.sub.P, the cache context for process X erodes.
Two problems with the prior art scheduler are illustrated by considering what happens when process X (in the example above) becomes runnable. First, if process X is not hard affinitied, it is enqueued to a global run queue. However, as noted above, there is only approximately a 1/m chance that process X will next run on engine 1.sub.P. Therefore, even though the cache context for process X may be very high in cache 1.sub.P of engine 1.sub.P, process X will probably be run on another engine. If process X is run on another engine, some of the capability of system 10 will be used in moving data over system bus 12.sub.P. As data is moved over system bus 12.sub.P to the other engine, system performance may be reduced.
Second, if process X is hard affinitied, it will be enqueued onto affinity run queue 38 of engine 1.sub.P, regardless of how many other processes are enqueued onto affinity run queue 38 of engine 1.sub.P and regardless of whether other engines are idle. Therefore, system performance may be reduced because of idle engines.
Thus, the prior art scheduler poorly reuses cache memory unless the flexibility of symmetrical multiprocessing is given up by using hard affinity. Therefore, there is a need for a scheduler that causes a runnable process to be enqueued onto the affinity run queue of an engine when the cache context (or warmth) of the process with respect to the engine is sufficiently high, and to be enqueued onto a global run queue when the cache context is sufficiently low. Additionally, periodic CPU load balancing calculations (schedcpu()) could be improved to maintain a longer-term view of engine load and to cause redistribution of processes if a significant excess of processes exists at any particular engine. Further, such redistribution of processes could consider the priority of the processes to be moved.