Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both processing units—the “brains” of a computing system—and the memory that stores the data processed by a computing system.
In general, a processing unit is a microprocessor or other integrated circuit that operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a “memory address space,” representing an addressable range of memory regions that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speed of microprocessors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory often becomes a significant bottleneck on performance of the microprocessor as well as the computing system. To decrease this bottleneck, it is often desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple “levels” of memories in a memory architecture to attempt to decrease costs with minimal impact on performance. Often, a computing system relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory (DRAM) devices or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory (SRAM) devices or the like. Information from segments of the memory regions, often known as “cache lines” of the memory regions, are often transferred between the various memory levels in an attempt to maximize the frequency that requested cache lines are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory request from a requester attempts to access a cache line, or entire memory region, that is not cached in a cache memory, a “cache miss,” or “miss,” typically occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance penalty. Whenever a memory request from a requester attempts to access a cache line, or entire memory region, that is cached in a cache memory, a “cache hit,” or “hit,” typically occurs and the cache line or memory region is supplied to the requester.
Cache misses in particular have been found to significantly limit system performance. In some designs, for example, it has been found that over 25% of a microprocessor's time is spent waiting for retrieval of cache lines after a cache miss. Therefore, any mechanism that can reduce the frequency and/or latency of cache misses can have a significant impact on overall performance.
One conventional approach for reducing the impact of cache misses is to increase the size of the cache to in effect reduce the frequency of misses. However, increasing the size of a cache can add significant cost. Furthermore, oftentimes the size of the cache is limited by the amount of space available on an integrated circuit device. Particularly when the cache is integrated onto the same integrated circuit device as a microprocessor to improve performance, the amount of space available for the cache is significantly restricted.
Another conventional approach includes decreasing the miss rate by increasing the associativity of a cache, and/or using cache indexing to reduce conflicts. While each approach can reduce the frequency of data cache misses, however, each approach still incurs an often substantial performance hit whenever cache misses occur.
Yet another conventional approach for reducing the impact of cache misses incorporates various prediction techniques to attempt to predict what data will be returned in response to a cache miss prior to actual receipt of such data.
However, conventional approaches for reducing the impact of cache misses often introduce additional problems to shared memory computing systems. Generally, shared memory computing systems include a plurality of microprocessors that share a common memory. Microprocessors are permitted to obtain exclusive or shared ownership of a cache line, with the former usually required whenever a microprocessor needs to modify data stored in the cache line, and the latter being permitted whenever multiple microprocessors merely need to read the data in the cache line. A coherence protocol, typically using either a central directory or a snooping protocol, is used to coordinate the retrieval of a cache line by a microprocessor, such that a requesting microprocessor always receives a current copy of the data in a cache line. A coherence protocol often requires a microprocessor to broadcast a request over a shared memory bus, which results in a lookup being performed either in a central directory or in each individual node in the shared memory system to determine the status of the requested cache line, with the requested cache line ultimately returned to the requesting processor and the status of that cache line being updated to reflect the new ownership status of the cache line. Given that a memory bus is a limited resource, the broadcast of memory requests over the memory bus can result in decreased performance, so it is desirable whenever possible to minimize the number of memory requests that are broadcast over a shared memory bus.
One difficulty encountered in shared memory computing systems occurs when multiple microprocessors are attempting to access the same cache line at the same time. In some systems, microprocessors are forced to compete for the same cache line, often resulting in inefficiencies as the cache line is shuttled back and forth between caches, memory levels, and microprocessors of the shared memory computing system, and often without having time to be processed or updated. Moreover, conventional approaches for sharing and prefetching data typically introduce additional intra-node communications. For example, it often occurs that microprocessors processing one cache line often request another cache line from the same memory region. As such, a microprocessor is typically forced to broadcast a first memory request for a first cache line of the memory region, a second memory request for a second cache line of the memory region, and so-on. Thus, the microprocessors of the shared memory computing system are generally forced to respond to the communications unnecessarily as memory requests must be processed to determine if the requested data is present in those nodes, and if so, a response must be generated. Therefore, any mechanism configured to share memory regions and reduce the frequency and/or severity of competition between the microprocessors can have a significant impact on overall performance. Moreover, any mechanism configured to reduce the frequency of communications between the microprocessors can also have a significant impact on overall performance.
Still another conventional approach for reducing the impact of microprocessor communications involves optimizing routing for data requests and uses coarse-grain coherence tracking to monitor the coherence of memory regions and the use of that information to avoid unnecessary broadcasts. With coarse-grain coherence tracking, the status of cache lines is tracked with a coarser granularity, e.g., on a region-by-region basis, where each region contains multiple cache lines. By doing so, information about the access characteristics of multiple cache lines within the same region can be used to make more intelligent prefetching decisions and otherwise reduce memory request latency. In particular, it has been found that coarse-grain coherence tracking eliminates about 55% to about 97% of unnecessary broadcasts for cache lines, and thus improves performance by about 8%. Specifically, coarse-grain coherence tracking uses a region coherence array to track memory regions cached and prevent unnecessary subsequent broadcasts for cache lines from a memory region.
One more conventional approach for reducing the impact of microprocessor communications incorporates stealth prefetching into coarse-grain coherence tracking to identify non-shared memory regions and aggressively prefetch cache lines from those memory regions. In particular, stealth prefetching often does not broadcast a memory request to prefetch cache lines from non-shared memory regions, thus preventing unnecessary broadcasts for cache lines from a non-shared memory region. However, conventional approaches for reducing the impact of cache misses, reducing the impact of microprocessor competition, and reducing the impact of microprocessor communications often introduce problems in shared memory computing systems. Stealth prefetching, on the other hand, is limited to prefetching non-shared data and typically does not prefetch a memory region when cache lines of that memory region are shared by more than one microprocessor.
Consequently, there is a need in the art for reducing the impact of cache misses, reducing the impact of microprocessor competition, and improving microprocessor communications in a shared memory computing system.