1. Field of the Invention
The present invention relates to handling read and write operations and associated hazards within a memory cache.
2. Background of the Related Art
A memory cache (alternately referred to as “cache”) is a computer system component for temporarily storing a selected portion of instructions and/or data from a primary storage device, such as main memory (RAM) or a hard disk drive. For example, nearly every modern microprocessor employs a Level 1 (L1) and Level 2 (L2) cache for storing data and instructions from main memory for access by the processor. A memory cache has less storage capacity, but may be accessed more quickly than the storage device whose memory is being cached. A memory cache is therefore used to store the portion of data and instructions from the storage device that is most likely to be accessed, such as the most relevant or frequently accessed data, to reduce the amount of time spent accessing main memory. L1 cache can be built directly into the processor and can run at the same speed as the processor, providing the fastest possible access time. L2 cache is also used to store a portion of main memory and may be included within a chip package, but is usually separate from the processor. L2 cache is slower than L1 cache, but typically has a greater storage capacity than L1 cache and is still much faster than main memory.
L1 cache typically includes an instruction cache and a data cache. An L1 instruction cache contains a copy of a portion of the instructions in main memory. An L1 data cache contains a copy of a portion of data in main memory, but some designs allow the data cache to contain a version of the data that is newer than the data in main memory. This is referred to as a store-in or write-back cache because the newest copy of the data is stored in the data cache and because it must be written back out to memory when that cache location is needed to hold a different piece of data. An L2 cache typically contains both instruction and data.
Some systems having multiple processors (or processor cores) include a separate L1 cache for each processor, but share a common L2 cache. This is referred to as a shared L2 cache. Because such an L2 cache may have to handle several read and/or write operations simultaneously from multiple processors and even from multiple threads within the same physical processor, a shared L2 cache is usually more complex than an L2 cache dedicated to a single processor. A shared L2 cache typically has a number of Read-claim (RC) machines to handle the read/write operations that originate from the multiple processors and threads. The RC machines are often responsible for doing such things as searching the L2 cache, returning data/instructions for the sought after address, updating the L2 cache, and requesting data from memory or from the next level of cache if the sought after address does not exist in the L2 cache.
The memory cache used with main memory may be mapped to the main memory in a variety of ways. Examples of cache mapping known in the art include direct-mapped cache, fully associative cache, and N-way set-associative cache. Direct mapping involves dividing main memory according to the number of cache lines provided, so that each division of main memory shares a particular cache line. At the other end of the spectrum, fully associative cache allows any cache line to store the contents of any memory location in main memory. N-way set-associative cache involves a compromise between direct mapping and fully-associative mapping, wherein the cache is divided up into multiple “sets,” each set containing some number of cache lines (alternately referred to as “ways”). Typically, set-associative cache structures contain 2, 4 or 8 ways per set. Each memory address is placed into one and only one set, but can be held in any one of the ways within that set. The collection of memory addresses that can be placed into any one set is called a congruence class.
A “hazard” occurs when two different RC machines seek to make potentially conflicting changes to a cache. When a hazard is detected, the RC seeking to perform a read or write request must wait for the conflicting RC to complete the read or write request it is currently performing, to prevent errors. Hazards include architectural hazards caused by architectural constraints and design implementation hazards caused by design constraints. For example, one architectural constraint may be a read-after-write requirement, which leads to an architectural hazard because an RC machine handling the read request must wait for the RC machine handling the write request. One example of a design constraint is congruence class matching, which is intended to ensure that at most one RC machine will be active at a time for each congruence class. For example, tracking of which cache way (within the set assigned to that congruence class) is to be replaced when a new line is brought in from memory is simplified because the Least Recently Used (LRU) array which holds that information can be read once knowing it will not be updated by another RC in midstream. A design hazard thus results when two RC machines seek to operate on the same congruence class simultaneously. For a particular architecture and L2 cache implementation, any number of different hazards may be possible and multiple hazards may exist simultaneously. When a hazard is detected between a read or write operation currently being performed and a requested read or write operation, the requested read or write operation is rejected and the RC machine handling the requested read or write operation must re-request the read or write operation at a later time.
A number of methods have been proposed in the art for generating re-requests for rejected RC machines. A first example method known in the art is for rejected RC machines to re-request immediately. This approach is fairly simple, and the micro-circuitry required to implement this method requires very little silicon area on the substrate on which the circuitry is implemented. However, each request consumes power, and this method results in rapid requests from all suspended RC machines, leading to increased power consumption. Also, the high rate of re-requests from suspended RC machines delays requests from other RC machines that would have been accepted, causing relatively poor performance.
A second example method known in the art is to blindly re-request at fixed or random time intervals to reduce the frequency of retry requests. The circuitry required to implement this method occupies a greater silicon area and consumes more power than in the first method, without increasing the certainty that a particular hazard may be cleared at the time of re-request. Thus, this second method still has relatively poor performance because re-requests may happen too soon or wait too long after the hazard clears.
A third example method known in the art is for each RC machine to signal when it goes idle. The logical OR of these signals is used by all suspended RC machines as the indication to re-request. This method also requires very little silicon area and reduces the frequency of retry requests relative to the first method. However, because all suspended RC machines re-request whenever any RC machine goes idle, there are still frequent requests, causing relatively high power consumption and low performance. Note that because all suspended RC machines re-request as soon as any RC machine goes idle, in a heavily utilized system with lots of RC machines, there stands a good chance that the conflicting RC machine is still active; hence, the re-requesting RC will be rejected again.
A fourth example method known in the art is to return a unique RC identifier (ID) of the conflicting RC machine to the requesting RC machine. The requesting RC machine is suspended and holds this ID. When each active RC machine goes idle, it broadcasts its identifier and all suspended RC machines compare against this to determine when to re-request. This method generates precise re-requests for a particular hazard. However holding and comparing the hazard's machine ID in each RC machine requires a relatively large silicon area that increases with an increasing number of RC machines. Also, because only one hazard ID is held, if one or more additional hazards exist, the re-request may be imprecise. Additionally, this approach may result in a “window” condition wherein the rejected RC is suspended around the time that the conflicting RC machine goes IDLE, causing the suspended RC machine to “miss” the ID broadcast of the conflicting RC machine. Additional complexity is required to close this window condition.
A fifth example method known in the art that precisely handles multiple hazards is to generate a full ID-based dependency vector, wherein one bit per RC machine is saved by each requesting RC machine to indicate whether a hazard exists or not. Multiple hazards from multiple RC machines can be saved in this method. Each active RC that goes idle clears its assigned bit in the dependency vector of all suspended RC machines. When a dependency vector of a suspended RC machine is cleared, that RC machine can re-request. This results in a precise re-request for multiple hazards, at the expense of a relatively large silicon area that increases exponentially with increasing number of RC machines.