Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both microprocessors--the "brains" of a computer--and the memory that stores the information processed by a computer.
In general, a microprocessor operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a "memory address space," representing the addressable range of memory addresses that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speed of microprocessors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory can often become a significant bottleneck on performance. To decrease this bottleneck, it is desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple "levels" of memories in a memory architecture to attempt to decrease costs with minimal impact on system performance. Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory devices (SRAM's) or the like. In some instances, instructions and data are stored in separate instruction and data cache memories to permit instructions and data to be accessed in parallel. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as "cache lines", between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a "cache miss" occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant degradation in performance.
Another manner of increasing computer performance is to use multiple microprocessors operating in parallel with one another to perform different tasks at the same time. Often, the multiple microprocessors share at least a portion of the same memory system to permit the microprocessors to work together to perform more complex tasks. The multiple microprocessors are typically coupled to one another and to the shared memory by a system bus or other like interconnection network. By sharing the same memory system, however, a concern arises as to maintaining "coherence" between the various memory sources in the shared memory system.
For example, in a typical multi-processor environment, each microprocessor may have one or more dedicated cache memories that are accessible only by that microprocessor, e.g., level one (L1) data and/or instruction cache, a level two (L2) cache, and/or one or more buffers such as a line fill buffer and/or a transition buffer. Moreover, more than one microprocessor may share certain caches and other memories as well. As a result, any given memory address may be stored from time to time in any number of memory sources in the shared memory system.
Coherence is typically maintained via a central directory or via a distributed mechanism known as "snooping", whereby each memory source maintains local state information about what data is stored in the source and provides such state information to other sources so that the location of valid data in the shared memory system can be ascertained. With either scheme, data may need to be copied into and/or out of different memory sources to maintain coherence, e.g., based upon whether a copy of the data has been modified locally within a particular memory source and/or whether a requester intends to modify the data once the requester has access to the data. Any time data is copied into or out of a particular memory source, however, the memory source is temporarily unavailable and the latency associated with accessing data stored in the source is increased.
As a result, it is often desirable for performance considerations to minimize the amount of data transfers, or bandwidth, between memory sources in a shared memory system. Minimizing data transfers with a particular memory source increases its availability, and thus reduces the latency required to access the source.
Many shared memory systems also support the concept of "inclusion", where copies of cached memory addresses in higher levels of memory are also cached in associated caches in lower levels of memory. For example, in the multi-processor environment described above, all memory addresses cached in the L1 cache for a microprocessor are also typically cached in the L2 cache for the same microprocessor, as well as within any shared caches that service the microprocessor. Consequently, whenever a processor requests data stored in the shared memory system, the data is typically written into each level of cache that services the processor.
Inclusion is beneficial in that the number of snoops to lower level caches can often be reduced given that a higher level cache includes directory entries for any associated lower level caches. However, having to write data into multiple memory sources occupies additional bandwidth in each memory source, which further increases memory access latency and decreases performance. Furthermore, storing multiple copies of data in multiple memory sources such as caches reduces the effective storage capacity of each memory source. With a reduced storage capacity, hit rates decrease, thus further reducing the overall performance of a shared memory system. Moreover, particularly with a snoop-based coherence mechanism, as the number of memory sources that contain a copy of the same data increases, the amount of bandwidth occupied by checking and updating state information and maintaining coherence increases as well.
Therefore, a significant need continues to exist for a manner of increasing the performance of a shared memory system, particularly to reduce the bandwidth associated with each memory source and thereby decrease memory access latency throughout the system.