1. Technical Field
The present invention relates to data processing systems and in particular to memory systems of a multiprocessor data processing system. Still more particularly the present invention relates to a method and system for providing more efficient operation of caches in a multiprocessor data processing system.
2. Description of the Related Art
A data-processing system typically includes a processor coupled to a variety of storage devices arranged in a hierarchical manner. In addition to a main memory, a commonly employed storage device in the hierarchy includes a high-speed memory known as a cache memory. A cache memory speeds up the apparent access times of the relatively slower main memory by retaining the data or instructions that the processor is most likely to access again, and making the data or instructions available to the processor at a much lower latency. As such, cache memory enables relatively fast access to a subset of data and/or instructions that were recently transferred from the main memory to the processor, and thus improves the overall speed of the data-processing system.
In a conventional symmetric multiprocessor (SMP) data processing system, all of the processors are generally identical, insofar as the processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system may comprise a system memory, a plurality of processing elements that each include a processor and one or more levels of cache memory and a system bus coupling the processing elements to each other and to the system memory. Many such systems include at least one level of cache memory shared between two or more processors and which support direct processor cache to processor cache transfer of data (or intervention). To obtain valid execution results in a SMP data processing system, it is important to maintain a coherent memory hierarchy, that is, to provide a single view of the contents of memory to all of the processors.
During typical operation of a cache hierarchy that supports intervention among processor caches, a cache line that is sort to be modified is requested via an address broadcast mechanism that utilizes the system bus/interconnect (i.e., the address of the cache line is sent out to all the caches). As the number of processors that make up the multiprocessor system increased, a switch-based configuration was utilized in place of the traditional bus configuration to connect the processors to each other. Utilization of a switch enables inter-processor (or processor group) operations (e.g., requests, command, etc.) to be sent directly (i.e., without a broadcast to the entire system).
The size of multiprocessor systems, particularly the number of processors and/or processor groups, is continually increasing. For example, an 8-way processing system may be interconnected to seven other similar 8-way processing systems to create a 64-way processing system with 8 independent processing nodes. In addition to the increase in the number of processors and processor speeds, increases in the size of caches and resulting longer latency for coherency operations transacted on the cache led to the creation and utilization of cache directories and the implementation of directory-based cache coherency. Accordingly, each memory/cache component comprises a memory/cache directory, which is primarily utilized for reducing snoop response times and maintaining cache coherency more efficiently.
A coherent memory hierarchy is maintained through the use of a selected memory coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each coherency granule (i.e., cache line) of at least all upper level (cache) memories. Each coherency granule can have one of four states, modified (M), exclusive (E), shared (S), or invalid (I), which can be encoded by two bits in the cache directory. Those skilled in the art are familiar with the MESI protocol and its use to ensure coherency in memory operations.
Each cache line (block) of data in a SMP system, typically includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. In current processing systems, both the address tag field and the state bit field are contained in the cache directory. This cache directory may be organized under any caching scheme available, such as fully associative, direct mapped, or set-associative, as are well-known in the art. The tag within the address tag field may be a full address for a fully associative directory, or a partial address for a direct-mapped directory or a set-associative directory. The bits within the state bit field are utilized to maintain cache coherency for the data-processing system.
FIG. 2 illustrates a cache with associated cache directory according to current processor designs. Cache 201 comprises 64 cache lines consecutively numbered 0-63. As illustrated in FIG. 2, cache 201 has associated cache directory 203, which consists of address tag and coherency state bits. The address tag is a subset of the full address of the corresponding memory block. During operation, a compare match of an incoming address with one of the tags within the address tag field indicates a cache xe2x80x9chitxe2x80x9d if the entry is in a valid state. If no compare match occurs or the entry is in the invalid (I) state then a cache xe2x80x9cmissxe2x80x9d occurs.
Improvements in silicon technology, etc. have resulted in the increase in cache sizes and thus, the amount of data each cache is able to hold. Subsequently, very few cache misses occur that are caused because the requested data is not present in the local processor cache. Rather, those misses which occur today are primarily due to invalidates, i.e., the local cache line exists in the I coherency state. Local cache misses are thus more likely to occur due to snooped xe2x80x9cinvalidationxe2x80x9d operations than due to the cache not having the data.
Typically, a bus xe2x80x9csnoopingxe2x80x9d technique is utilized to invalidate cache lines during cache coherency operation. Each cache performs a snooping operation by which changes to cache lines that are sent on the system bus are reflected within the local cache in order to maintain coherency amongst the caches. For example, whenever a read or write is performed, the address of the data is broadcast from the originating processor core to all other caches sharing a common bus (or connected via a switch). Each cache snoops the address from the bus and compares the address with an address tag array in the cache directory. If a hit occurs, a snoop response is returned which triggers a coherency operation, such as invalidating the hit cache line, in order to maintain cache coherency.
When a local cache miss occurs, the requesting processor typically broadcasts the request by sending the address out to the system bus (or switch). A snoop response of xe2x80x9cretryxe2x80x9d is issued from a cache with the valid data when the cache has a modified copy of the data that must first be pushed out of the cache or when there was a problem that prevented appropriate snooping. In the case of a retry response, the processor from which the request originated will retry the read or write operation until the data is received. The processor is forced to broadcast the retry because no information is available as to which processor/cache has a valid copy of the requested data. This often leads to a large number of retry operations that utilizes significant bus resources and degrades overall processor speed and performance (i.e., long latencies/coherency resolution and high retry penalties).
The present invention recognizes that, in light of technological improvements (i.e., larger caches and increased processor speeds) and the subsequent increased occurrence of cache misses due primarily to invalidates, it would be desirable to provide a method and system that allows a processor to quickly retrieve correct data when an invalidate is encountered for a desired cache line. A system, method, and processor cache configuration that reduces the incident of re-tries from a processor node in response to a cache miss caused by an invalidated cache line would be a welcomed improvement. These and other benefits are provided by the present invention described herein.
Disclosed is a method, system, and processor cache configuration that enables efficient retrieval of valid data in response to an invalidate cache miss at a local processor cache. A cache directory is enhanced by appending a set of directional bits in addition to the coherency state bits and the address tag. The directional bits provide information that includes the processor cache identification (ID) and routing method. The processor cache ID indicates which processor operation resulted in the cache line of the local processor changing to the invalidate (I) coherency state. The processor operation may be issued by a local processor or by a processor from another group or node of processors if the multiprocessor system comprises multiple nodes of processors. The routing method indicates what transmission method to utilize to forward a request for the cache line. The request may be forwarded to a local system bus or directly to another processor group via a switch or broadcast mechanism. Processor/Cache directory logic is provided to set and interpret the values of the directional bits and provide responses depending on the values of the bits.
During operation, a snooping processor causes the cache state of the snooped cache line to be set to invalid. When a local processor, i.e., a processor associated with the snooped cache, issues a request for the cache line, the local processor reads the invalid coherency state of the cache line from the cache directory. The cache directory logic then reads the directional bits and forwards the request to the specific processor (or cache) indicated by the identification bits via the routing mechanism indicated by the routing bits.
The above, as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.