1. Field of the Invention
This invention is related to computer systems and, more particularly, to maintaining cache coherence in computer systems.
2. Description of the Related Art
Computer systems have generally implemented one or more levels of cache to reduce memory latency. The caches are smaller, higher speed memories than the memory in the main memory system. Typically, caches store recently-used data. For example, caches are often implemented for processor access, and store data recently read/written by the processors in the computer systems. Caches are also sometimes implemented for other high speed devices in the computer system as well. In addition to storing recently-used data, caches can be used to store prefetched data that is expected to be used by the processor (or other device).
Caches store copies of data that is also stored in main memory. In multiprocessor systems, and even in single processor systems in which other devices access main memory but do not access a given cache, the issue of cache coherence arises. That is, a given data producer can write a copy of data in the cache, and the update to main memory's copy is delayed. In write-through caches, a write operation is dispatched to memory in response to the write to the cache line, but the write is delayed in time. In the more common writeback cache, writes are made in the cache and not reflected in memory until the updated cache block is replaced in the cache (and written back to main memory by the cache). Writeback caches generally reduce the memory bandwidth consumed by writes, and thus are more popular.
Because the updates have not been made to main memory at the time the updates are made in cache, a given data consumer can read the copy of data in main memory and obtain “stale” data (data that has not yet been updated). Additionally, if multiple data producers are writing the same memory locations, different data consumers could observe the writes in different orders.
Cache coherence solves these problems by ensuring that various copies of the same data (from the same memory location) can be maintained while avoiding “stale data”, and by establishing a “global” order of reads/writes to the memory locations by different producers/consumers. If a read follows a write in the global order, the data read reflects the write.
Cache coherence schemes create an overhead on memory read/write operations. Typically, caches will track a state of their copies according to the coherence scheme. For example, the popular Modified, Exclusive, Shared, Invalid (MESI) scheme includes a modified state (the copy is modified with respect to main memory and other copies); an exclusive state (the copy is the only copy other than main memory); a shared state (there may be one or more other copies besides the main memory copy); and the invalid state (the copy is not valid). The MOESI scheme adds an Owned state in which the cache is responsible for providing the data for a request (either by writing back to main memory before the data is provided to the requestor, or by directly providing the data to the requester), but there may be other copies in other caches. Thus, the overhead of the cache coherence scheme includes communications among the caches to maintain/update the coherence state. These communications can increase the latency of the memory read/write operations.
The overhead is dependent on the structure of the computer system. More specifically, the overhead depends on the form of interconnect between the various caches and data producers/consumers. In a shared bus system, snooping is often implemented to maintain coherence. A given memory request transmitted on the bus is captured by other caches, which check if a copy of the requested data is stored in the cache. The caches can update the state of their copies (and provide data, if the cache has the most up to date copy).
As the number of processors included in computer systems has grown, point to point interconnects have become more common. Point to point interconnects can typically be operated at higher frequencies than shared buses, since the electrical load on a given line is lower and often the line lengths are shorter. The aggregate bandwidth of the interconnect is generally higher. However, latency typically increases as well since communications may need to be routed through one or more intermediate nodes from source to destination. Additionally, since there is no common point of communication (like the shared bus), other mechanisms for providing a coherence point are implemented. For example, a home node is often assigned to each address in the memory address range, and coherence of the data corresponding to the address is coordinated by the home node. Typically, the home node also includes the memory locations that form the main memory for the addresses assigned to the home node. Communications with the home node can further increase the latency.