1. Field of the Invention
The invention relates to cache coherency mechanisms in a multiple processor environment, and more particularly, to a mechanism for reducing the number of snoops required of a processor structure which includes a cache memory.
2. Description of Related Art
Many computer systems include at least one level of cache memory. A cache memory is a high-speed memory that is positioned between a central processing unit (CPU) and main memory in a computer system in order to improve system performance. Cache memories (or caches) store copies of portions of main memory data that are actively being used by the CPU while a program is running. Since the access time of a cache can be faster than that of main memory, the overall access time for accesses by the CPU can be reduced. Descriptions of various uses of and methods of employing caches appear in the following articles: Kaplan, "Cache-based Computer Systems," Computer, 3/73 at 30-36; Rhodes, "Caches Keep Main Memories From Slowing Down Fast CPUs," Electronic Design, Jan. 21, 1982, at 179; Strecker, "Cache Memories for PDP-11 Family Computers," in Bell, "Computer Engineering" (Digital Press), at 263-67, and Intel, "i486 Processor Hardware Reference Manual" (1990) at 6-1 through 6-11, all incorporated herein by reference.
Many microprocessor-based systems implement a "direct mapped" cache memory. In general, a direct mapped cache memory comprises a high speed data Random Access Memory (RAM) and a parallel high speed tag RAM. The RAM address of each line in the data cache is the same as the low-order portion of the main memory line address to which the entry corresponds, the high-order portion of the main memory address being stored in the tag RAM. Thus, if main memory is thought of as 2.sup.m blocks of 2.sup.n "lines" of one or more bytes each, the i'th line in the cache data RAM will be a copy of the i'th line of one of the 2.sup.m blocks in main memory. The identity of the main memory block that the line came from is stored in the i'th location in the tag RAM.
When a CPU requests data from memory, the low-order portion of the line address is supplied as an address to both the cache data and cache tag RAMs. The tag for the selected cache entry is compared with the high-order portion of the CPU's address and, if it matches, then a "cache hit" is indicated and the data from the cache data RAM is enabled onto a data bus of the system. If the tag does not match the high-order portion of the CPU's address, or the tag data is invalid, then a "cache miss" is indicated and the data is fetched from main memory. It is also placed in the cache for potential future use, overwriting the previous entry. Typically, an entire line is read from main memory and placed in the cache on a cache miss, even if only a byte is requested. On a data write from the CPU, either the cache RAM or main memory or both may be updated, it being understood that flags may be necessary to indicate to one that a write has occurred in the other.
Accordingly, in a direct mapped cache, each "line" of secondary memory can be mapped to one and only one line in the cache. In a "fully associative" cache, a particular line of secondary memory may be mapped to any of the lines in the cache; in this case, in a cacheable access, all of the tags must be compared to the address in order to determine whether a cache hit or miss has occurred. "k-way set associative" cache architectures also exist which represent a compromise between direct mapped caches and fully associative caches. In a k-way set associative cache architecture, each line of secondary memory may be mapped to any of k lines in the cache. In this case, k tags must be compared to the address during a cacheable secondary memory access in order to determine whether a cache hit or miss has occurred. Caches may also be "sector buffered" or "sub-block" type caches, in which several portions of a cache data line, each with its own valid bit, correspond to a single cache tag RAM entry.
When the CPU executes instructions that modify the contents of the cache, these modifications must also be made in the main memory or the data in main memory will become "stale." There are two primary techniques for keeping the contents of the main memory consistent with that of the cache--(1) the write-through method and (2) the write-back or copy-back method. In the write-through method, on a cache write hit, data is written to the main memory immediately after or while data is written into the cache. This enables the contents of the main memory always to be valid and consistent with that of the cache. In the write-back method, on a cache write hit, the system writes data into the cache only and sets a "dirty bit" (or enters a "modified" state) which indicates that a data word has been written into the cache but not into the main memory. On a subsequent cache read miss, which requires a cache line to be replaced (filled) with new data from memory, a cache controller checks for a dirty bit before overwriting any line of data in the cache. If the dirty bit for the cache line is set, the cache controller writes the line of data out to main memory before loading the cache with new data.
A computer system can have more than one level of cache memory for a given address space. For example, in a two-level cache system, the "level one" (L1) cache is logically adjacent to the host processor. The second level (L2) cache is logically behind the first level cache, and other memory (which in this case can be referred to as tertiary memory), typically DRAM or SDRAM, is located logically behind the second level cache. When the host processor performs an access to an address in the memory address space, the first level is cache responds if possible. If the first level cache cannot respond (for example, because of an L1 cache miss), then the second level cache responds if possible. If the second level cache also cannot respond, then the access is made to the tertiary memory. The host processor does not need to know how many levels of caching are present in the system or indeed that any caching exists at all. Similarly, the first level cache does not need to know whether a second level of caching exists prior to the tertiary memory. Thus, to the CPU, the combination of both caches and tertiary memory is considered merely as a single main memory "structure". Similarly, to the L1 cache, the combination of the L2 cache and tertiary memory is considered simply as a single main memory structure. In fact, a third level (L3) of caching could be included behind the L2 cache, and the L2 cache would still consider the combination of L3 and subsequent memory as a single main memory structure.
The PowerPC.TM. 603 microprocessor, available from IBM and Motorola, is an example of a microprocessor which has an on-chip, two-way set associative cache memory. This cache is divided into a data cache and a separate instruction cache. The data cache on a PowerPC 603 is a write-back cache. The cache is actually programmable based on the address specified to follow a write-through or a write-back policy, but special precautions must be taken externally to the chip as long as even one line is able to follow a write-back policy as further explained below. Thus, as used herein, a "write-back cache" is a cache memory, any part of which can hold data which is inconsistent with that in the external memory subsystem.
In systems having multiple devices which share a common address space, a cache coherency protocol is implemented in order to provide the same image of memory to all such devices. Such a protocol allows synchronization and cooperative use of shared resources. Otherwise, multiple copies of a memory location, some containing stale values, could exist in a system and errors could result. One popular write-back cache coherency protocol is known as the MESI (modified/exclusive/shared/invalid) protocol. The MESI protocol is described in "Intel, Pentium Processor User's Manual", Vol. 1: "Pentium Processor Databook" (1993), incorporated herein by reference, especially at pp. 3-20 through 3-21. A superset of the MESI protocol, known as MOESI, is described in Thorson, "Multiprocessor Cache Coherency", Microprocessor Report, pp. 12-15 (Jun. 20, 1990), also incorporated by reference. In the MESI protocol, each cache data line is accompanied by a pair of bits which indicate the status of the line. Specifically, if a line is in state M, then it is "modified" (has been written to since it was retrieved from main memory). An M-state line can be accessed (read or written) by the CPU without sending a cycle out on an external bus to higher levels of the memory subsystem.
If a cache line is in state E ("exclusive"), then it is not "modified" (i.e. it contains the same data as subsequent levels of the memory subsystem). In shared cache systems, state E also indicates that the cache line is available in only one of the caches. The CPU can access (read or write) an E-state line without generating a bus cycle to higher levels of the memory subsystem, but when the CPU performs a write access to an E-state line, the line then becomes "modified" (state M).
A line in state S ("shared") may exist in more than one cache. A read access by the CPU to an S-state line will not generate bus activity, but a write access to an S-state line will cause a write-through cycle to higher levels of the memory subsystem in order to permit the sharing cache to potentially invalidate its own corresponding line. The write will also update the data in the data cache line.
A line in state I is invalid. It is not available in the cache. A read access by the CPU to an I-state line will generate a "cache miss" and may cause the cache to execute a line fill (fetch the entire line into the cache from higher levels of the memory subsystem). A write access by the CPU to an I-state line will cause the cache to execute a write-through cycle to higher levels of the memory subsystem.
The PowerPC 603 implements a cache coherency protocol which is a coherent subset of the MESI protocol omitting the shared (S) state. Since data cannot be shared, the PowerPC signals all cache line fills as if they were cache write misses (reads with intent to modify), thereby flushing the corresponding copies of the data in all caches external to the PowerPC prior to the PowerPC's cache line fill operation. Following the cache line fill, the PowerPC is the exclusive owner of the data and may write to it without a bus broadcast transaction (state E).
Computer system cache memories typically cache main memory data for the CPU. If the cache uses a write-back protocol, then frequently the cache memory will contain more current data than the corresponding lines in main memory. This poses a problem for other devices which share the same address space in the memory, because these devices do not know whether the main memory version is the most current version of the data. Similarly, for both write-back and write-through caches, even if the data in the cache is not modified with respect to that in memory, the CPU must be kept informed of write accesses to memory by external devices. Otherwise, the CPU would not know whether the cached version is the most current copy of the data. Cache controllers, therefore, typically support inquire cycles (also known as snoop cycles), in which a device essentially asks the cache memory to indicate whether it has a more current copy of the data.
In PowerPC-based systems, a device issues a snoop cycle by driving the snoop address onto the CPU bus and asserting the processor's TS and GBL control signals. The processor responds by asserting its ARTRY output if the specified data line is present in the internal cache and the specified cache line is in the M (modified) state. (If the specified data line is present in the internal cache but it is unmodified (state E), then the processor merely invalidates the line in the cache. Similarly, if the specified data line is present in the internal cache but the snoop cycle is for a write access to the entire line, then the processor merely invalidates the line in the cache. In either case, ARTRY is not asserted.) Thus, ARTRY, when asserted, indicates that the internal cache contains a more current copy of the data than is in main memory. The processor then automatically conducts a write-back cycle while the external device waits. By this process, therefore, the external device will be able to access the desired line in main memory without any further concern that the processor's internal cache contains a more current copy of the data.
The time required to perform the snoop cycle, however, is significant. This is a problem not only because of the CPU bus bandwidth occupied by snoop cycles, but also because of the delays they impose on memory accesses by the external device. In systems in which the external devices are performance-critical, such as in graphics coprocessor arrangements, the need to snoop every memory access can substantially impact performance.
One technique that has been used in the past to minimize the number of snoops required by an external device, is simply to designate parts of the memory address space as being dedicated to the external device. For example, in systems having a graphics coprocessor, an area of the memory address space may be designated the frame buffer and dedicated to the coprocessor. The coprocessor never needs to snoop the CPU's cache because only the coprocessor, and not the CPU, can read or write to the frame buffer. But this solution greatly limits the flexibility of the system: it may be most desirable, for example, for the CPU to render some parts of an image while the coprocessor renders other parts of the same image. Dedicating the frame buffer to the coprocessor precludes such flexibility. Moreover, this solution avoids the question of how to minimize snoops when an external device accesses shared regions of the memory address space; dedicating an area of memory exclusively to the external device renders it no longer shared.
Another technique to minimize snoops of a processor's internal cache, has been used only on high-end systems which include a second-level (L2) cache extend to the processor. Specifically, the system enforces a rule that data cannot be cached in the processor's internal cache, unless it is also cached in the L2 cache. In such a system, the external device first snoops the L2 cache, and then snoops the processor's internal cache only if there is an L2 cache hit. The device does not need to snoop the processor's internal cache if there is an L2 cache miss. But this solution is expensive in that it requires a second-level cache external to the processor.
Accordingly, a definite need continues to exist for an alternative mechanism for reducing the number of snoop cycles required to a processor structure having an internal cache memory.