The present invention generally relates to a processing system of the type including a plurality of processor subsystems and more particularly to such a processing system and method wherein each processor subsystem includes a lock buffer for controlling exclusive critical program section accesses by the processor subsystems.
Processing systems may include a plurality of processors, each forming a processor subsystem, which require access to a shared memory over a common bus in order to execute instructions in accordance with their respective programs. Such programs may include related program portions which require a processor to use resources, such as shared memory, which other processors must also be required to use when executing their related program portions. Because such resources are shared, a processor must execute a related program section individually and not simultaneously with the execution of a related program portion by another processor in order to guarantee correct operation. Such related program portions are known as critical sections and multiprocessor systems must be arranged to provide a processor exclusive access to a shared resource, such as a shared memory, by assuring that only one processor is executing a critical section at any one time.
Hence, in a shared memory multiprocessor system, provision must be made to allow a processor to have exclusive access to some shared resource during the time in which it is executing a critical section. When a processor is executing a critical section, no other processor can be in a related critical section. A critical section must be guarded so that only one processor can be in a critical section at any one time. The guard may be a code segment that precedes a critical section and which has the function to prevent more than one processor from executing a critical section.
One prior art method for implementing the guard into a critical section uses interlock variables. An interlock variable may have one of two values, an available value indicating that no processor is executing a critical section, and a busy value indicating that a processor is executing a critical section. In accordance with this method, the shared memory includes a memory location for storing the value of the interlock variable and each processor includes a register. When a processor wishes to enter a critical section, it reads the interlock variable within the memory location of the shared memory and loads that value of the interlock variable into its register. The processor also writes back to the interlock variable memory location of the shared memory the busy value of the interlock variable. The reading and writing of the interlock variable are performed atomically so that no other processor can access the common bus between the read and the write. If, following the read and the write, the register of the processor contains a busy value, the processor will not enter its critical section but instead will perform the read and write operation again. However, if the register of the processor contains the available value of the interlock variable, the processor will enter its critical section.
The foregoing forms a loop in which the interlock variable in the interlock variable memory location of the shared memory is being tested. Such a loop is a type of guard known as a spin-lock. A busy value of the interlock variable indicates to the testing processor that another processor "owns" the interlock variable and is in a critical section. An available value of the interlock variable indicates that no processor is in its critical section. This processor acquires the interlock variable by writing the busy value into the interlock variable to communicate to all the processors that it is in its critical section. The processor then enters its critical section and no other processor wishing to enter a critical section will be able to do so until the processor in its critical section has completed its critical section.
When an owning processor completes execution of its critical section, it then communicates this fact to the other processors by writing the available value to the interlock variable in the shared memory. The next processor wishing to enter a critical section will then test the available value of the interlock variable and perform the same operations to enter its critical section. Ownership transfer of the interlock variable thus occurs when one processor writes the available value into the interlock variable and another processor subsequently acquires it in the manner described above.
While this method simplifies the implementation of assuring exclusive access to a critical section, the common bus becomes a performance bottleneck. This results because a processor wishing to acquire the interlock variable and enter a critical section must continually utilize the common bus to test the value of the interlock variable in the shared memory.
Another and still more efficient method of providing exclusive access to a critical section by a processor employs a cache associated with each of the processors for storing, locally to each processor, the most recent value of the interlock variable. Such caches can allow the value of the interlock variable to be modified relative to the shared memory. When the cache of another processor attempts to read the value of the interlock variable from the shared memory, the cache with the most recently modified value of the interlock variable intervenes and supplies the value of the interlock variable instead of the shared memory. In this way, all of the caches see the same correct value of the interlock variable even though the caches may be more up-to-date than the shared memory.
In such a system, when a processor desires to enter a critical section, its cache fetches the value of the interlock variable from either memory or the cache having the most recently updated value of the interlock variable, stores it, and then sends that value to its associated processor. If the interlock variable has a busy value, the cache does not follow the read with a write. Subsequent testing of the value of the interlock variable by this processor is performed locally in its cache, and, as a result, the shared bus is not accessed for this purpose. All processors wishing to enter a critical section obtain the busy value of the interlock variable in its associated cache and goes into a loop, with each processor testing its local copy.
Eventually, the owning processor releases the interlock variable by executing a write instruction for writing the available value of the interlock variable on the shared bus while a "LOCK pin" is asserted. Each cache with a copy of the interlock variable invalidates its copy upon seeing the locked write. The next time such a processor wishes to enter a critical section, its cache will obtain, over the common bus, the available value of the interlock variable from either the shared memory or a cache, will become the owner of the interlock variable, and locally set the value of the interlock variable in its cache to the busy value. Thus, processors that subsequently read the interlock variable will read a busy value.
Hence, in accordance with the above-described prior art method, the common bus is used more efficiently by allowing each processor to cache the value of the interlock variable locally within its cache and to locally test the value of the interlock variable without using the common bus except for initially loading the interlock variable into its cache. Considerable common bus traffic still occurs, however, after a processor completes a critical section and writes the available value to the interlock variable. This is because all processors invalidate their copies of the value of the interlock variable when the owning processor releases the interlock. Each processor must then in turn obtain the new value of the interlock variable by the common bus, making the common bus a bottleneck in the process. With this method, the number of common bus accesses each time ownership transfer of the interlock occurs is proportional to the number of processors waiting to enter a critical section. This level of common bus activity is still too high for processing systems having a large number of processors.
Another and still more efficient method and processing system for providing exclusive access to a critical section by a processor is fully disclosed and claimed in copending United States patent application, Ser. No. 07/513,806, filed Apr. 24, 1990, for "Interlock Variable Acquisition System and Method" in the names of the inventors of the instant invention and which is also assigned to the assignee of the present invention. That system also utilizes a cache associated with each of the processors for storing, locally to each processor, the most recent value of the interlock variable. When a processor completes a critical section, it broadcast writes the available value of the interlock variable over the common bus to the other caches associated with processors which share the interlock variable, to thus release the interlock variable. When another processor subsequently wishes to enter a critical section, it tests the value of the interlock variable in its cache since it did not invalidate its copy when the interlock variable was released but instead updated the value. If this processor detects the available value it proceeds to first obtain the common bus. If the bus is obtained before the value of the interlock variable is changed to the busy value by another processor, the busy value is then written into the local cache and broadcast written to the other caches while the available value is returned to the now owning processor. The immediately foregoing method further reduces common bus traffic since all caches contain the most recent value of the interlock variable and need not access the common bus to first acquire the most recent value, unless, for some reason, the value in a cache is invalid. Hence, in most cases, the only bus traffic required for accessing a critical section is by the accessing processor when it acquires the interlock variable and when it releases the interlock variable.
However, utilizing caches for obtaining and releasing an interlock variable still exhibits some deficiencies. For example, the main purpose for providing caches is to store data locally to a processor to decrease the frequency of common bus accesses to obtain data from shared memory to support processor executions. To take advantage of what is well known in the art as locality of reference, caches are generally arranged to store data in multiple-word blocks. There are two aspects of the locality of reference: temporal and spatial. The temporal locality of reference states that a processor is likely to access the same location again in short time. The spatial locality of reference states that a processor is more likely to access locations that are close (may be adjacent) to each than are far from each other.
Caches take advantage of the spatial locality of reference in the following manner. When a cache is reading from memory the location that is requested by the processor but is not in cache, it reads in the location being requested and several others adjacent to it in the hopes that the processor will use the other locations in the near future. However, access patterns of interlock variables do not show spatial locality of reference. In other words, if caches are used to store the interlock variables in its multiple-word blocks, the locations that are read from memory but not specifically requested by the processor are not likely to be used by the processor. For the caching of interlock variables, then, the multiple-word block organization is inefficient because additional bus cycles are needed to read the unwanted locations from memory and because the space being occupied by the unwanted locations is wasted. Furthermore, caches with multiple-word blocks to store the interlock variables could have a negative effect due to what is known in the art as the coherency overhead. The coherency is the act of keeping multiple copies of the same data identical in different caches for correct operation. Keeping caches coherent requires using shared bus to notify the caches of any changes in data that is in more than one cache. This is known as the coherency overhead. Because the coherency is maintained on a block basis, that is there is only one shared bit to indicate that "all" of the words in the block is shared or not, a coherency action is needed whenever processor writes to one location in a block that is shared. Consequently, the probability of requiring a coherency action on a write to a block containing four words is greater than on a write to a block containing only a single word.
Another deficiency in using conventional caches to store interlock variables is that caches are generally configured to perform one read or write at a time. When a cache is being accessed from the common bus, it cannot be accessed by its local processor. As a result, local processing will be interrupted; this is a condition referred to as processor lock-out. This reduces the processing efficiency of the overall processing system. Interlock variables are often shared and therefore contribute to processor lock-out because coherency actions through the common bus are required on every write to a shared variable.
A further deficiency in using caches to store interlock variables relates to the use of hierarchical caches. In the past, this has not been a significant problem because each processor has been provided with just one level of cache. However, multiple-level caches are now being introduced and will most likely become a prominent practice in the future. To simplify cache coherency mechanisms, multiple-level caches use what is known as the "inclusion property." The inclusion property states that the contents of the primary cache, which is the one that is physically closer to the processor and may be even in the same chip as the processor, is a subset of the contents of the secondary cache in a two-level cache system. The usual two-level cache organization is to connect only the secondary cache to the shared bus so that the primary cache can be shielded from the coherency activities that occur on the shared bus. Since the secondary cache contains everything that is in the primary cache, information necessary to maintain coherency is readily available in the secondary cache, often without disturbing the primary cache.
An efficient way of managing interlock variables in a two-level cache is to store them only in the secondary cache. If the interlock variables are stored in the primary caches, as well as in the secondary cache to follow the inclusion property, any writes to them would likely require coherency actions since they are likely to be shared. The coherency actions would involve updating both the primary and secondary caches. The updates would have to be atomic with respect to the processors' access to either the primary or the secondary cache. The coherency actions would be simpler and more efficient if the interlock variables are kept only in the secondary cache because the primary cache need not be modified by the coherency actions. However, to store the interlock variables only in the secondary cache requires a way for the processor to access the secondary cache without accessing the primary cache. This bypass path is an addition to the usual two-level cache organization where the processor does not need to access the secondary cache directly -- only the primary cache communicates with the secondary cache in the usual two-level cache organization. In most cases, the processor is designed to be even ignorant of the fact that a secondary cache exists so that the use of secondary cache can be made optional.
The present invention provides solutions to all of the aforementioned deficiencies of prior art processing systems for acquiring and releasing interlock variables. Instead of storing the interlock variables in caches primarily arranged for storing data, the processing system of the present invention provides each processor with a local buffer dedicated for storing interlock variables and related control bits. As a result, the dedicated buffers may be tailored for their intended use to increase the storage efficiency for storing interlock variables. As will be seen hereinafter, the dedicated buffers of the processing system and the method of using the same in accordance with the present invention also significantly reduce the probability of processor lock out, may be readily employed in hierarchical cache systems without unduly adding complexity, and even further reduce traffic on the common bus because when accessed over the common bus for the acquisition or release of an interlock variable, only the interlock variable and its control bits need be carried on the common bus as compared to a multiple-word block.