There is an ever growing need to increase the speed with which computers process information. One element for increasing overall processing speed includes improving memory access time. As one skilled in the art recognizes, memory latency is a major limitation on processing speed, an issue that has been addressed using a multitude of techniques and approaches.
A common manner by which to improve memory access time is to provide a cache memory along with a main memory. A cache memory is typically associated with a processor, and requires less access time than the main memory. Copies of data from processor reads and writes are retained in the cache. Some cache systems retain recent reads and writes, while others may have more complex algorithms to determine which data is retained in the cache memory. When a processor requests data that is currently resident in the cache, only the cache memory is accessed. Since the cache memory requires less access time than the main memory, processing speed is improved. Today, memory accesses from the main memory may take as long as 250 nanoseconds (or more) while cache access may take as little as two or three nanoseconds.
Additionally, a cache system may be used to increase the effective speed of a data write. For example, if a processor is to write to a storage location, the processor may perform a data write to the cache memory. The cache memory and associated control logic may then write the data to the main memory while the processor proceeds with other tasks.
Computer systems may also extend the use of cache and may employ a multilevel hierarchy of cache memory, with a small amount of relatively fast primary or first level cache memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost, higher-capacity memory at the lowest level of the hierarchy. Typically, the hierarchy includes a small but fast memory system called a primary cache, either physically integrated within a processor integrated circuit or mounted physically near the processor. Primary cache incorporated on the same chip as the Central Processing Unit (CPU) may have a clock frequency (and therefore an access time) equal to the cycle frequency of the CPU. There may be separate instruction primary cache and data primary cache. Primary caches typically maximize performance over a relatively small amount of memory so as to minimize data and/or instruction latency. In addition, primary cache typically supports high bandwidth transfers. Secondary cache or tertiary cache may also be used and is typically located further from the processor. These secondary and tertiary caches provide a “backstop” to the primary cache and generally have larger capacity, higher latency, and lower bandwidth than primary cache.
If a processor requests data or instruction from a primary cache and the item is present in the primary cache, a cache “hit” results. Conversely, if an item is not present, there is a primary cache “miss.” In the event of a primary cache miss, the requested item is retrieved from the next level of the cache memory or, if the requested item is not contained in cache memory, from the main memory.
Typically, memories are organized into groupings of bits called words (for example, 32 bits or 64 bits per word). The minimum amount of memory that can be transferred between a cache and a next lower level of the memory hierarchy is called a cache line, or sometimes a block. A cache line is typically multiple words (for example, 16 words per line). Memory may also be divided into pages (also called segments), with many lines per page. In some systems, page size may be variable.
Caches have been constructed using three principal architectures: direct-mapped, set-associative, and fully-associative. Details of the three cache types are described in the following prior art references, the contents of which are hereby incorporated by reference in their entirety: De Blasi, “Computer Architecture,” ISBN 0-201-41603-4 (Addison-Wesley, 1990), pp. 273-291; Stone, “High Performance Computer Architecture,” ISBN 0-201-51377-3 (Addison-Wesley, 2d Ed. 1990), pp. 29-39; Tabak, “Advanced Microprocessors,” ISBN 0-07-062807-6 (McGraw-Hill, 1991) pp. 244-248.
With direct mapping, when a line of memory is requested, only one line in the cache has matching index bits. Therefore, the data can be retrieved immediately and driven onto a data bus before the system determines whether the rest of the address matches. The data may or may not be valid, but in the usual case where it is valid, the data bits are available on a data bus before the system confirms validity of the data.
With set-associative caches, it is not known which line corresponds to an address until the index address is computed and the tag address is read and compared. That is, in set-associative caches, the result of a tag comparison is used to select which line of data bits within a set of lines is presented to the processor.
A cache is said to be fully associative when a cache stores an entire line address along with the data and any line can be placed anywhere in the cache. However, for a large cache in which any line can be placed anywhere, substantial hardware is required to rapidly determine if and where an entry is stored in the cache. For large caches, a faster, space saving alternative is to use a subset of an address (called an index) to designate a line position within the cache, and then store the remaining set of more significant bits of each physical address (called a tag) along with the data. In a cache with indexing, an item with a particular address can be placed only within a set of cache lines designated by the index. If the cache is arranged so that the index for a given address maps to exactly one line in the subset, the cache is said to be direct mapped. If the index maps to more than one line in the subset, the cache is said to be set-associative. All or part of an address is hashed to provide a set index which partitions the address space into sets.
In all three types of caches, an input address is applied to comparison logic. Typically, a subset of the address, called tag bits, is extracted from the input address and compared to tag bits of each cache entry. If the tag bits match, then corresponding data is extracted from the cache.
In general, direct-mapped caches provide fastest access but require the most time for comparing tag bits. Fully-associative caches have greater access time but consume higher power and require more complex circuitry.
When multiple processors with their own respective caches are included in a system, cache coherency protocols are used to maintain coherency between and among the caches. This is because the same data may be stored in or requested by more than one cache. There are two classes of cache coherency protocols:
1. Directory based: The information about one block of physical memory is maintained in a single, common location. This information usually includes which cache(s) has a copy of the block and whether that copy is marked exclusive for future modification. An access to a particular block first queries the directory to see if the memory data is stale and the current data resides in some other cache (if at all). If it is, then the cache containing the modified block is forced to return its data to memory. Then the memory forwards the data to the new requester, updating the directory with the new location of that block. This protocol minimizes interbus module (or inter-cache) disturbance, but typically suffers from high latency and is expensive to build due to the large directory size required.
2. Snooping: Every cache that has a copy of the data from a block of physical memory also has a copy of the information about the data block. Each cache is typically located on a shared memory bus, and all cache controllers monitor or “snoop” on the bus to determine whether or not they have a copy of the shared block.
Snooping protocols are well suited for multiprocessor system architecture that use caches and shared memory because they operate in the context of the preexisting physical connection usually provided between the bus and the memory. Snooping is often preferred over directory protocols because the amount of coherency information is proportional to the number of blocks in a cache, rather than the number of blocks in main memory.
The coherency problem arises in a multiprocessor architecture when a processor must have exclusive access to write a block of memory or an object into memory, and/or must have the most recent copy when reading an object. A snooping protocol must locate all caches that share the object to be written. The consequences of a write to shared data are either to invalidate all other copies of the data, or to broadcast the write to all of the shared copies. Because of the use of write-back caches, coherency protocols must also cause checks on all caches during memory reads to determine which processor has the most up to date copy of the information.
Data concerning information that is shared among the processors is added to status bits that are provided in a cache block to implement snooping protocols. This information is used when monitoring bus activities. On a read miss, all caches check to see if they have a copy of the requested block of information and take the appropriate action, such as supplying the information to the cache that missed. Similarly, on a write, all caches check to see if they have a copy of the data, and then act, for example by invalidating their copy of the data, or by changing their copy of the data to reflect the most recent value.
Snooping protocols are of two types:
Write invalidate: The writing processor causes all copies in other caches to be invalidated before changing its local copy. The processor is then free to update the data until such time as another processor asks for the data. The writing processor issues an invalidation signal over the bus, and all caches check to see if they have a copy of the data. If so, they must invalidate the block containing the data. This scheme allows multiple readers but only a single writer.
Write broadcast: Rather than invalidate every block that is shared, the writing processor broadcasts the new data over the bus. All copies are then updated with the new value. This scheme continuously broadcasts writes to shared data, while the write invalidate scheme discussed above deletes all other copies so that there is only one local copy for subsequent writes. Write broadcast protocols usually allow data to be tagged as shared (broadcast), or the data may be tagged as private (local). For further information on coherency, see J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, Inc. (1990), the disclosure of which is incorporated herein by reference in its entirety.
In a snooping coherence multiprocessor system architecture, each coherent transaction on the system bus is forwarded to each processor's cache subsystem to perform a coherency check. This check usually disturbs and/or disrupts the processor's pipeline because the cache cannot be accessed by the processor while the coherency check is taking place.
In a traditional, single ported cache without duplicate cache tags, the processor pipeline is stalled on cache access instructions when the cache controller is busy processing cache coherency checks for other processors. For each snoop, the cache controller must first check the cache tags for the snoop address, and then modify the cache state if there is a hit. Allocating cache bandwidth for an atomic (unseparable) tag read and write (for possible modification) locks the cache from the processor longer than needed if the snoop does not require a tag write. For example, 80% to 90% of the cache queries are misses, i.e. a tag write is not required. In a multi-level cache hierarchy, many of these misses may be filtered if the inclusion property is obeyed. An inclusion property allows information to be stored in the highest level of cache concerning the contents of the lower cache levels.
The speed at which computers process information for many applications, can also be increased by increasing the size of the caches, especially the primary cache. As the size of the primary cache increases, main memory accesses are reduced and the overall processing speed increases. Similarly, as the size of the secondary cache increases, the main memory accesses are reduced and the overall processing speed is increased, though not as effectively as increasing the size of the primary cache.
Typically, in computer systems, primary caches, secondary caches and tertiary caches are implemented using Static Random Access Memory (SRAM). The use of SRAM allows reduced access time which increases the speed at which information can be processed. Dynamic Random Access Memory (DRAM) is typically used for the main memory as it is less expensive, requires less power, and provides greater storage densities.
Typically, prior art computer systems also limited the number of outstanding transactions to the cache at a given time. If more than one transaction were received by a cache, the cache would process the requests serially. For instance, if two transactions were received by a cache, the first transaction request received would be processed first with the second transaction held until the first transaction was completed. Once the first transaction was completed the cache would process the second transaction request.
Numerous protocols exist that maintain cache coherency across multiple caches and main memory. One such protocol is called MESI which is described in detail in M. Papamarcos and J. Patel, “A Low Overhead Coherent Solution for Multiprocessors with Private Cache Memories,” in Proceedings of the 11th International Symposium on Computer Architecture, IEEE, New York (1984), pp. 348-354, incorporated herein by reference in its entirety. MESI stands for Modified, Exclusive, Shared, Invalid, the four status conditions for data. Under the MESI protocol, a cache line is categorized according to its use. A modified cache line indicates that the particular line has been written to by the cache that is the current “owner” of the line. (As used herein, the term “owner” and “alike” refers to a designation representing authority to exercise control over the data). An exclusive cache line indicates that a cache has exclusive ownership of the cache line, which will allow the cache controller to modify the cache line. A shared cache line indicates that one or more caches have ownership of the line. A shared cache line is considered read only and any device under the cache may read the line but is not permitted to write to the cache. An invalid cache line or a cache line with no owner identifies a cache line whose data may not be valid since the cache no longer owns the cache line.