The desire to improve computer system performance has driven innovation in both computer architecture and operating systems design. In many contexts, an important goal is to maximize throughput (the number of tasks processed in a given unit of time). One persistent obstacle has been the so-called Von Neumann bottleneck, the gap between faster processors and slower memory. Two general architectural approaches to minimizing this gap and increasing throughput are the use of a multilevel memory hierarchy and the use of multiple processors. In a memory hierarchy, faster, smaller memories are placed closer to the processor, and efforts are made to keep recently-accessed items in the fastest memories. The fastest level of memory, apart from the processor's registers, is the cache, which can itself have one or more levels, and which is organized into cache lines of a fixed size. A cache hit occurs when the processor finds a requested data item in the cache. If the data is not found in the cache, a cache miss occurs, and a line-size block of data that includes the requested item is retrieved from the main memory and placed in a particular cache line.
Most modern multiprocessor machines belong to one of two groups, distinguished by how their memory is organized. The first group includes machines with a relatively small number of processors sharing a single centralized main memory connected to the processors by a bus. These machines are called UMAs (uniform memory access multiprocessors) because the time to access any main memory location is uniform for each processor. The second group comprises machines in which memory is physically distributed among the processors, allowing larger numbers of processors to be supported. Those machines in this group that feature a logically-shared main memory address space are known as NUMAs (non-uniform memory access multiprocessors). In NUMA machines, a processor's access time for a particular data word in main memory depends on the location of that word.
Modern operating systems maximize processor utilization and minimize memory latency through their support for multiprogramming, the ability of several concurrently-executing programs, referred to as processes or threads, to share computer resources. When a first process is being executed by a processor, and the operating system then permits a second process to execute, the operating system performs a context switch. A context switch involves saving the current state of the first process so that its execution can continue at a later time, and restoring the previous state of the program that is about to return to execution.
In systems supporting multiprogramming, there is potentially a high degree of interaction among concurrent processes. The effective coordination of multiple processes is a central problem in operating systems design. It is typical for several processes to require access to some object residing in shared memory, which may itself include methods or data made available by another process. At some point, it may be necessary for such objects to be “run down” (destroyed). For example, the shared object might be a loaded antivirus driver, which typically has a long lifetime (that is, it will be kept loaded while the operating system is running). Periodically, however, it will be necessary for the driver to be unloaded so that an updated version of the driver can be loaded in its place. In such situations, there is a danger that a process might attempt to access an object that has already been deleted or made unavailable, leading to erroneous and unpredictable program and system behavior. It is important, therefore, to guard against the premature destruction or removal of a shared object while references to the object are outstanding.
Solutions to this problem have generally made use of synchronization mechanisms provided by the operating system or the hardware. One conventional approach is to protect the object by placing it under a mutually exclusive lock that can only be acquired by one process at a time. If a second process requires access to the object, it must wait until the lock is released by the first process. This is undesirable for a number of reasons. The time spent on acquisitions and releases of locks may exceed the time needed for access to the object itself, causing a performance bottleneck. Moreover, the use of such locks may lead to deadlocked processes vying for access to the same object. Efforts to minimize the occurrence of deadlocks typically require the diversion of substantial computing resources. Locks also involve the consumption of considerable memory space.
A more finely-grained solution, called rundown protection, has been implemented in Microsoft Corporation's “WINDOWS XP” operating system. Under this approach, a global reference count associated with a particular object is used to ensure that destruction or removal of the object will be delayed until all of the accesses that have already been granted to the object have completed and been released. Access serialization is achieved using atomic interlocked hardware instructions rather than mutually exclusive software locks. Rundown protection is optimized for rapid and lightweight accesses and releases of object references. This is sensible, because typically protection is desired in situations involving a long-lived object which is referenced and dereferenced many times throughout its lifetime. In the loaded antivirus driver example, the references are I/O requests, with the dereferences occurring when the I/O is completed, and rundown protection may be invoked by a kernel component, such as a file system filter manager, to guard against premature unloading of the driver.
Despite the advantages of rundown protection over previous solutions, its routines for acquiring and releasing references have been discovered to cause significantly degraded performance on multiprocessor machines when used by I/O-coordinating file system filter managers to manage the unloading of file system filter drivers, the subject of a copending commonly-assigned U.S. patent application filed today, bearing U.S. Ser. No. 10/461,078. Interlocked increments and decrements were found to cause cache pinging: following a reference count increment or decrement, corresponding cache lines on all the processors would update their caches with the new value, flushing and refreshing the cache in accordance with the machine's coherence protocols. The degradation worsens as more processors are included in the system. It would be desirable, therefore, to provide an improved form of rundown protection that would scale with the number of processors in a multiprocessor computer system.