1. Technical Field
The present invention relates to data processing systems in general and, in particular, to improved cache operations within a data-processing system. Still more particularly, the present invention relates to an improved method, system, and processor cache topology that more efficiently supports cache coherency operations within a data-processing system.
2. Description of the Prior Art
A data-processing system typically includes a processor coupled to a variety of storage devices arranged in a hierarchical manner. In addition to a main memory, a commonly employed storage device in the hierarchy includes a high-speed memory known as a cache memory (or cache). A cache speeds up the apparent access times of the relatively slower main memory by retaining the data or instructions that the processor is most likely to access again, and making the data or instructions available to the processor at a much lower latency. As such, caches enable relatively fast access to a subset of data and/or instructions that were recently transferred from the main memory to the processor, and thus improves the overall speed of the data-processing system.
Most contemporary high-performance data processing system architectures include multiple levels of cache memory within the memory hierarchy. Cache levels are typically employed in progressively longer access latencies. Smaller, faster caches are employed at levels within the storage hierarchy closer to the processor (or processors) while larger, slower caches are employed at levels closer to system memory.
In a conventional symmetric multiprocessor (SMP) data processing system, all of the processors are generally identical, insofar as the processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system may comprise a system memory, a plurality of processing elements that each include a processor and one or more levels of cache memory and a system bus coupling the processing elements to each other and to the system memory. Many such systems include at least one level of cache memory shared between two or more processors. To obtain valid execution results in a SMP data processing system, it is important to maintain a coherent memory hierarchy, that is, to provide a single view of the contents of memory to all of the processors.
A coherent memory hierarchy is maintained through the use of a selected memory coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each coherency granule (i.e., cache line) of at least all upper level (cache) memories. Each coherency granule can have one of four states, modified (M), exclusive (E), shared (S), or invalid (I), which can be encoded by two bits in the cache directory. Those skilled in the art are familiar with the MESI protocol and its use to ensure coherency among memory structures.
Each cache line (block) of data in a SMP system typically includes an address tag field, a state bit field, an inclusivity bit field, and a value/data field for storing the actual instruction or data. In current processing systems, both the address tag field and the state bit field are contained in a cache directory. This cache directory may be organized under any caching scheme available, such as fully associative, direct mapped, or set-associative, as are well-known in the art. A compare match of an incoming address with one of the tags within the address tag field indicates a cache xe2x80x9chit.xe2x80x9d
Current implementation of a coherent Symmetric MultiProcessor (SMP) requires a coherence bus on which all memory transactions that change the state of any of the lines in the caches can be xe2x80x9csnoopedxe2x80x9d (i.e., observed) by all processors. In response to a snoop operation by a processor, all the processors must interrogate their cache directories to identify if the line involved in the transaction was cached in that processor cache. This process may also involve broadcasting of the snoop out to the coherency buses. If a matching directory entry is found, indicating that the cache line is present, the cache line may have to be written back to the next level of cache, written back to main memory, or invalidated, depending on the transaction observed. This coherency scheme has the disadvantage that either the processors must arbitrate for a coherence bus, and thus incur delays, or that a separate coherence bus must be provided for each processor as illustrated in FIG. 1A, requiring the implementation of a large number of external connections (pins) on the limited real estate of the processor.
As shown by FIG. 1A, SMP comprises four processing modules 101A-101D, each having a respective central processing unit (CPU) 103A-103D and level 1 (L1) cache 105A-105D. L1 caches 105A-105D each have an associated L1 directory 107A-107D, which are interconnected to each other via a series of cache coherency buses 111. Cache coherency buses 111 extend from pins (connectors) of processing modules 101A-101D to other pins of the other processing modules and to the L2 cache 109. The number of pins required for the connections and the real estate required for the coherency buses are dependent on the number of processors within the multi-processing system that support coherency operations. Thus, with current 32-way, 64-way, and larger SMPs, the number of required pins and complexity of coherency buses may be prohibitive to further development of large SMPs on progressively smaller real estate.
An alternative coherency scheme currently being utilized provides a xe2x80x9cdirectory-based coherence,xe2x80x9d by which the state information of the L1 directories is included in the L2 directory. FIG. 1B illustrates this coherency scheme. As shown, L2 directory 156 contains the directory entries of L1 directory 155A-155D. However, since there are typically many more lines in the L2 cache 159 than in the combined L1 caches or processors 151A-151D, the directory-based scheme utilizes more chip area, requires a large amount of storage devoted to coherence, and therefore takes a longer time to interrogate (snoop) the L1 directory because the entire L2 directory 155 has to be viewed.
In light of the foregoing, the present invention recognizes that it would be desirable to provide a processor-cache configuration that supports more efficient coherency operations without requiring additional hardware. A processor-cache configuration that reduces the number of cache coherency buses and associated coherency bus transactions required to support coherency would be a welcomed improvement. These and other benefits are provided by the invention described herein.
Disclosed is a processor-cache configuration and operational scheme within a multi-processor data processing system having a shared lower level cache (or memory) by which the number of coherency busses is reduced and more efficient snoop resolution and coherency operations with the processor caches are provided. A copy of the processor""s internal (L1) cache directory is provided within the lower level (L2) cache or memory. Lower level snoop operations and coherency operations directed to the L1 cache are evaluated and completed utilizing the copy of the L1 directory in the L2 cache. Updates to the coherency states of the copy of the L1 directory are mirrored in the L1 directory and L1 cache. The configuration and operational scheme eliminates the need for the individual coherency buses interconnecting each processor that is coupled to the L2 cache and speeds up coherency operations because the snoops do not have to be transmitted to the L1 caches for initial resolution.
In the preferred embodiment, the L1 directory and L1 directory copy are initialized during system boot. A processor request for update is received and a check is made for a snoop hit in the L2 cache and in the copy of the L1 directory. That is, the snooped addressed is compared against the address tags of the copy of the L1 directory. If a snoop hit occurs in the L2 cache and in the copy of the L1 directory, then the snoop resolution utility of the L2 cache communicates with the processor to resolve the snoop and the L1 directory and the copy of the L1 directory are updated accordingly.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.