Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both processing units—the “brains” of a computing system—and the memory that stores the data processed by a computing system.
In general, a processing unit is a microprocessor or other integrated circuit that operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a “memory address space,” representing an addressable range of memory regions that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speeds of microprocessors, however, have increased relative to that of memory devices to the extent that retrieving instructions and data from memory often becomes a significant bottleneck on performance of the microprocessor as well as the computing system. To decrease this bottleneck, it is often desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple “levels” of memories (e.g., multiple levels and possibly multiple types of memory) in a memory architecture to attempt to decrease costs with minimal impact on performance. Often, a computing system relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory (DRAM) devices or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory (SRAM) devices or the like. Information from segments of the memory regions, often known as “cache lines” of the memory regions, are often transferred between the various memory levels in an attempt to maximize the frequency that requested cache lines are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory request attempts to access a cache line, or entire memory region, that is not cached in a cache memory, a “cache miss” typically occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance penalty. Whenever a memory request attempts to access a cache line, or entire memory region, that is cached in a cache memory, a “cache hit” typically occurs and the cache line or memory region is supplied to the requester.
Cache misses in particular have been found to significantly limit system performance. In some designs, for example, it has been found that over 25% of a microprocessor's time is spent waiting for retrieval of cache lines after a cache miss. Therefore, any mechanism that can reduce the frequency and/or latency of data cache misses can have a significant impact on overall performance.
One conventional approach for reducing the impact of cache misses is to increase the size of the cache to in effect reduce the frequency of misses. However, increasing the size of a cache can add significant cost. Furthermore, oftentimes the size of the cache is limited by the amount of space available on an integrated circuit device. Particularly when the cache is integrated onto the same integrated circuit device as a microprocessor to improve performance, the amount of space available for the cache is significantly restricted.
Another conventional approach includes decreasing the miss rate by increasing the associativity of a cache, and/or using cache indexing to reduce conflicts. Although each approach can reduce the frequency of data cache misses, each approach still incurs an often substantial performance hit whenever cache misses occur.
Yet another conventional approach for reducing the impact of cache misses incorporates various prediction techniques to attempt to predict what data will be requested in the future, and prefetching that data into a cache. Thus, when the data is later requested, the data is already resident in the cache, and no cache miss will occur.
Conventional approaches for reducing the impact of cache misses, however, often introduce performance problems in shared memory computing systems. In a shared memory computing system, a plurality of microprocessors share a common memory, and whenever a microprocessor needs to read or write to a particular piece of data in that memory, the microprocessor must retrieve that piece of data into one of its caches. When the data is being accessed by the microprocessors, other microprocessors may also need to access that data as well, so a coherency protocol is required to ensure that the data is coherent for all of the microprocessors. With some protocols, multiple microprocessors may be permitted to own redundant copies of data in a shared state when none of the microprocessors intends to modify the data, i.e., when every microprocessor only intends to read the data. However, whenever a microprocessor needs to modify a piece of data, most coherence protocols require that that microprocessor obtain the data in an exclusive state, which effectively precludes any other microprocessor from accessing that data until the owning microprocessor releases ownership of the data to ensure that any modifications to the data made by the owning microprocessor can be propagated to the rest of the shared memory computing system. Thus, any time two or more microprocessors need to access the same data, one or more of those microprocessors may have to wait for another processor to release that data, thereby decreasing the performance of those stalled microprocessors.
Conventional coherence protocols typically use either a central directory or a snooping protocol, and track the status of data on a cache line by cache line basis. Such protocols require a microprocessor to broadcast a request over a shared memory bus, which results in a lookup being performed either in a central directory or in each individual node in the shared memory system to determine the status of the requested data (e.g., some or all of the data in the cache line), with the requested data ultimately returned to the requesting processor and the status of that cache line being updated to reflect the new ownership status of the cache line.
One difficulty encountered in shared memory computing systems often occurs when shared-memory computing systems attempt synchronization behaviors that include simple synchronization operations, such as lock behavior and atomic update behavior. In general, lock behavior uses a lock variable to guard access to some shared data during a critical section of a program such that other threads and processes cannot access that shared data. When a thread holding the lock variable has completed the critical section, it may issue a release operation to unlock the shared data. Consequently, with a lock behavior, the shared data, once locked, is protected from access by other threads or processes until it is released by a later release operation. Atomic update behavior, on the other hand, typically quickly updates shared data to make a small change to the shared data without the need for a lock variable or a separate release operation. Other mechanisms are often used to ensure that the update to the data is atomic. In general, an atomic update includes an operation in which the update is complete once the shared data's value has been modified, but that does not require a lock variable to prevent access to that shared data. In general, an atomic update appears to a computing system to be a single operation in which there are only two possible outcomes: success or failure.
In conventional shared memory computing systems, lock behavior and atomic update behavior often present problems for coherence, particularly those with migratory data optimizations, due to the inability for a coherence protocol to determine what type of behavior is being implemented by a program executing on a microprocessor, particularly when the same synchronization primitives are used to implement both types of behavior. In particular, migratory data optimizations typically utilize separate migratory and non-migratory modified states to indicate when data that is owned by one microprocessor in a modified state can be migrated to another microprocessor that needs to access the data. The ability to specify certain data as being non-migratory, in particular, is helpful for lock behavior, since performance would suffer if a cache line within which a lock variable is stored was set by one microprocessor and then migrated to another microprocessor before the lock variable was released by the first microprocessor. In such a situation, the first microprocessor would be required to request a modifiable copy of that cache line in order to release the lock variable. Given also that the likely reason that the second microprocessor attempted to access the cache line was to try to lock the same data (which would currently be locked by the first microprocessor), the second microprocessor, upon obtaining the cache line, would still need to wait on the first microprocessor to release the lock variable before it could obtain the lock. By specifying a cache line as non-migratory, therefore, the migration of the cache line from the first microprocessor to the second microprocessor and back would be avoided, thereby enabling the first microprocessor to release the lock, and the second microprocessor to obtain the lock, more quickly, and with lower overhead.
In contrast, with atomic updates, migration of data is not as much of a concern, since presumably once an atomic update has been performed by one microprocessor, that microprocessor does not need to access the data further in order to implement the behavior. Consequently, the data associated with an atomic update behavior often can be held in a migratory state. Placing such data in a non-migratory state just causes performance problems.
Because many conventional shared memory computing systems typically use the same synchronization primitives for lock behavior and atomic update behavior, however, it is often difficult to determine whether a cache line should be placed in a migratory or a non-migratory state. Many conventional shared memory computing systems are thus typically configured to be optimized for either lock behavior or atomic update behavior, but not both.
On the other hand, some conventional shared memory computing systems utilize synchronization primitives that are exclusively used for either lock behavior or atomic update behavior, but not both. However, many shared memory computing system applications are configured to operate across multiple types of conventional shared memory computing systems and thus would require recompilation to take advantage of those exclusive synchronization primitives, increasing the cost to produce and operate those applications while tying them to one type of conventional shared memory computing systems.
One conventional approach for determining whether to use lock behavior or atomic update behavior is temporal silence. In typical shared computing systems, lock variables that lock a cache line are often reverted back to their original value when released. Thus, a synchronization primitive to acquire a lock and a synchronization primitive to release a lock often form a temporally-silent pair. A first microprocessor may therefore be configured to retain stale copies of cache lines subject to lock behavior until those cache lines are the subject of a synchronization operation of a second microprocessor. However, temporal silence fails to benefit atomic-update behavior because there is not a lock variable that is set to a value and then subsequently reverted, as in a temporarily silent pair.
Another proposed approach for determining whether to use lock behavior or atomic update behavior includes adding extra bits to synchronization operation primitives such that shared memory computing system applications may be configured to label those synchronization primitives appropriately as involving either lock behavior or atomic update behavior. However, this change to the instruction set requires shared memory computing system applications to be recompiled and libraries to be re-written. Moreover, this approach may not be able to be implemented on all shared memory computing architectures, as additional bits are required with each synchronization operation primitive, which may in turn require additional bus lines, command lines, and control registers associated therewith to be configured for those primitives.
Consequently, a need continues to exist for optimizing performance of a shared memory computer system for both lock behavior and atomic update behavior in such a manner that does not require changes to instruction set architectures, is configured to operate with multiple instruction set architectures, and will benefit existing shared memory computing systems.