Many current computer systems employ a multi-processor configuration that includes two or more processing units interconnected by a bus system and each being capable of independent or cooperative operation. Such a multi-processor configuration increases the total system processing capability and allows the concurrent execution of multiple related or separate tasks by assigning each task to one or more processors. Such systems also typically include a plurality of mass storage units, such as disk drive devices to provide adequate storage capacity for the number of task executing on the systems.
One type of multi-processor computer system embodies a symmetric multiprocessing (SMP) computer architecture, which is well known in the art as overcoming the limitations of single or uni-processors in terms of processing speed and transaction throughput, among other things. Typical, commercially available SMP systems are generally “shared memory” systems, characterized in that multiple processors on a bus, or a plurality of busses, share a single global memory or shared memory. In shared memory multiprocessors, all memory is uniformly accessible to each processor, which simplifies the task of dynamic load distribution. Processing of complex tasks can be distributed among various processors in the multiprocessor system while data used in the processing is substantially equally available to each of the processors undertaking any portion of the complex task. Similarly, programmers writing code for typical shared memory SMP systems do not need to be concerned with issues of data partitioning, as each of the processors has access to and shares the same, consistent global memory.
There is shown in FIG. 1 a block diagram of an exemplary multiprocessor system that implements an SMP architecture. For further details regarding this system, reference shall be made to U.S. Ser. No. 09/309,012, filed Sep. 3, 1999, the teachings of which are incorporated herein by reference.
Another computer architecture known in the art for use in a multi-processor environment is the Non-Uniform Memory Access (NUMA) architecture or the Cache Coherent Non-Uniform Memory Access (CCNUMA) architecture, which are known in the art as being an extension of SMP but which supplants SMPs “shared memory architecture.” NUMA and CCNUMA architectures are typically characterized as having distributed global memory. Generally, NUMA/CCNUMA machines consist of a number of processing nodes connected through a high bandwidth, low latency interconnection network. The processing nodes are each comprised of one or more high-performance processors, associated cache, and a portion of a global shared memory. Each node or group of processors has near and far memory, near memory being resident on the same physical circuit board, directly accessible to the node's processors through a local bus, and far memory being resident on other nodes and being accessible over a main system interconnect or backbone. Cache coherence, i.e. the consistency and integrity of shared data stored in multiple caches, is typically maintained by a directory-based, write-invalidate cache coherency protocol, as known in the art. To determine the status of caches, each processing node typically has a directory memory corresponding to its respective portion of the shared physical memory. For each line or discrete addressable block of memory, the directory memory stores an indication of remote nodes that are caching that same line.
There is shown in FIG. 2 a high-level block diagram of another exemplary multiprocessor system but which implements a CCNUMA architecture. For further details regarding this system, reference shall be made to U.S. Pat. No. 5,887,146, the teachings of which are incorporated herein by reference.
As is known to those skilled in the art, each of these multiprocessor computer systems includes an operating system that is executed on these systems so that software programs (e.g., spreadsheet, word processing programs, etc.) are executed on these multiprocessor systems, to control the access of these programs when being executed to various resources such as computer readable medium (e.g. hard drives), output media (e.g., printers), communications media (e.g., modem), and to control the execution of the one or more programs being executed/accessed on the multiprocessor computer system at or about the same time.
Before proceeding with describing prior art operating systems for use on multiprocessor computer systems, in particular dispatching, an understanding as to what certain terms are intended to mean is first undertaken. Although programs and processes appear similar on the surface, they are fundamentality different. A program is a static sequence of instructions, whereas a process is a set of resources reserved for the thread(s) that execute the program. For example, at the highest level of abstraction a process in a Windows NT environment comprises the following: an executable program, which defines initial code and data; a private virtual address space, which is a set of virtual memory addresses that the process can use; system resources, such as semaphores, communications ports, and files, that the operating system allocates to the process when threads open them during the program's execution; a unique identifier called a process ID (internally called a client ID); and at least one thread of execution.
A thread is the entity within a process that the operating system schedules for execution, without it the process's program cannot run. A thread typically includes the following components: the contents of a set of volatile registers representing the state of the processor; two stacks, one for the thread to use while executing in the kernel mode and one for executing in the user mode; a private storage area for use by subsystems, run-time libraries, and dynamic link libraries (DLLs); and a unique identified called a thread identifier (also internally called a client ID), process IDs and thread IDs are generated out of the same namespace, so they do not overlap. The volatile registers, the stacks and the private storage areas are called the thread's context. Because this information is different for each machine architecture that the operating system runs on, this structure is architecture specific.
Although threads have their own execution context, every thread within a process shares the process's virtual address space in addition to the rest of the resources belonging to the process. This means that all of the threads in a process can write to and read from each other's memory. Threads cannot reference the address space of another process, unless the other process makes available part of its private address as a shared memory section. In addition, to a private address space and one or more threads, each process has a list of open handles to objects such as files, shared memory sections, one or more synchronization objects such a mutexes, events or semaphores.
The kernel component of the operating system, sometimes referred to simply as the kernel, performs the most fundamental operations in the operating system, determining how the operating system uses the processor or processors and ensuring that the processor(s) are used prudently. The primary functions of the kernel included, thread scheduling and dispatching, trap handling and exception dispatching, multiprocessor synchronization and providing the base kernel objects that are used by the operating system executive. The kernel also handles context swapping, kernel event notification, IO and memory management. The kernel of a multi-processor computer system, more specifically determines which threads or processes run on which processors and also determines when the thread/process will run on a given processor.
The dispatcher is that part of the kernel that focuses on the scheduling function of when and where to run processes, more particularly the threads of such processes. The dispatcher also controls how long each thread can be allowed to run before being preempted to run another thread. Reference is made herein to Chapter 4 of “Inside Windows NT”, Second Edition, A. Solomon, 1988, the teachings of which are incorporated herein by reference, for further details as to the general process in a Windows NT environment for inter alia thread scheduling.
A crucial concept in operating systems is typically referred to as mutual exclusion and generally refers to making sure that one, and only one, thread can access a particular resource at a time. Mutual exclusion is necessary when a resource does not lend itself to shared access or when sharing would result in an unpredictable outcome. For example, the output of two threads from two files being copied to a printer port at the same time could become intermingled. Similarly, if one thread is reading from a memory address at the same time another thread is writing to the same address, the data being read by the first thread becomes unpredictable or unusable.
This concept of mutual exclusion is of particular concern for multi-processor computing systems because code is being run simultaneously on more than one processors, which code shares certain data structures stored in the particular form of system memory of the SMP or NUMA type of multi-processor system. Thus for multi-processor computing systems the kernel typically is configured to provide mutual exclusion primitives that it and the rest of the operating system executive use to synchronize their access to global data structures. The mechanism by which the kernel achieves multi-processor mutual exclusivity is a locking mechanism associated with the global data structure, commonly implemented as a spinlock. Thus, before entering and updating a global data structure, the kernel first acquires the spinlock and locks the global data structure. The kernel then updates the data structure and after updating the data structure, the spinlock is released.
One widely held and highly contended spinlock is the kernel dispatcher lock that provides a mechanism for protecting all data structures associated with thread execution, context swapping and kernel event notification. Because event notification can result in a change of executing thread and/or context swap, some operating systems including for example Windows NT utilize a single global lock to protect all dispatching data structures. The current dispatcher design, with reference to FIG. 3, implements a structure whereby the execution of threads of all processes coordinate wait notification through defined wait blocks. These software constructs allow any thread in the system to wait for any other dispatcher object in the system to be signaled.
There is shown in FIG. 3 illustrative wait data structures that show the relationship of dispatcher object to wait blocks to threads. In this illustrative example, none of the threads are being executed because they are waiting on dispatcher objects. As is also illustrated, thread 1 is waiting on both visible dispatcher objects and threads 2 and 3 are each waiting on only one of the two dispatcher objects. Thus, if only one of the two visible objects is signaled, the kernel will see that because thread 1 is also waiting on another object it cannot be readied for execution. On the other hand, the kernel will see that the thread waiting on the dispatcher object that is signaling can be readied for execution because it isn't waiting on other objects. There also is shown in FIG. 4 some selected kernel dispatcher objects as well as illustrating system events that can induce a change in the state of a thread(s), and the effect of the signaled state on waiting threads. In the case of an IO the completion of a DMA operation may signal a dispatcher object and a waiting thread would be readied for execution when the kernel dispatcher determines that the dispatcher object indicating the completion of DMA operation/IO process and referenced by (linked to) the waiting thread has been signaled.
Now referring to FIG. 5 there is shown a high level flow diagram illustrating the process followed to reschedule execution of a thread and updating of the data structure associated with dispatching. When the execution of a running thread is pre-empted, terminated or otherwise stopped, the kernel locks the entire dispatch database and examines the dispatch database structure to identify the next thread to be executed, STEPS 102, 104. This determination is achieved using any of a number of techniques and criteria known to those skilled in that art and is typically particular to the specific operating system. As noted above, some illustrative criterion are provided in Chapter 4 of “Inside Windows NT” for the scheduling of threads. In a multiprocessor computing system, such identification also includes identifying the processor on which the released or readied thread is to be executed.
The kernel then updates the dispatch database, STEP 106. For example, the kernel updates the database to note the change in state of the thread to be executed on a processor. The kernel also would evaluate the wait list or wait list structure and update it based on the actions taken to pre-empt or stop the thread that had been running. If it is determined that the thread which had been pre-empted or otherwise stopped (e.g., timer expiring) is to continue running then the kernel would update the database based on the added quantum or time the thread is to be run. Once, all of the updating, evaluating is completed, the kernel releases the global dispatcher lock, STEP 108. Thereafter, the identified readied or released thread is executed on the identified processor.
As noted above, the dispatcher lock is a global lock that prevents any other thread rescheduling to occur until the lock is released. Because each of the processes running on a multi-processor computing system involve or require thread scheduling/rescheduling, the scheduling of execution of the threads for different processes are in competition with each other no matter how simple or brief the process the kernel follows for dispatching. Likewise, because the notification (signaling) and querying of dispatcher objects (events) is also involved in the execution of many threads, these operations also are in competition with each other. Additionally, the dispatching process performed by the kernel becomes increasingly more time consuming as more concurrent thread rescheduling or event notification operations are performed and therefore contend for the dispatcher spinlock. Further, while the dispatching operation is being performed for one thread, the dispatching of other threads presently stopped and needing to be re-scheduled cannot be accomplished, and thus the applications processes/processors for these other threads are unable to proceed (i.e., pended or delayed). Consequently, the time to perform a task by the applications program/process is in effect increased by such delays.
It thus would be desirable to provide a methodology and operating system particularly for multi-processor computing systems that would allow parallel dispatching, at least for frequently occurring events, without having to employ a system wide global lock of the dispatching database or structure. It would be particularly desirable to provide such a methodology and operating system having a plurality of local locks for dispatching, each of the plurality of local locks locking a grouping of dispatchable objects. It also would be desirable to provide a multiprocessor computing system and/or software for execution on such systems embodying such methodologies. Further, it would be desirable to provide such, methods, operating systems, and multiprocessor computing systems that reduce the amount of time to perform a task in comparison to that provided using prior art dispatching methods.