1. Technical Field
The present invention relates generally to data processing systems and in particular to cache coherency operations within a multiprocessor data processing system (MP). Still more particularly, the present invention relates to chained intermediate coherency states for successive non-homogenous operations involving sequential accesses of a single cache line by multiple processors in an MP.
2. Description of the Related Art
A conventional multiprocessor data processing system (referred to hereinafter as an MP), typically comprises a system memory, input/output (I/O) devices, a plurality of processing elements that each include a processor and one or more levels of high-speed cache memory, and a system interconnect coupling the processing elements to each other and to the system memory and I/O devices. The processors may utilize common instruction sets and communication protocols, have similar hardware architectures, and may generally be provided with similar memory hierarchies.
Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory. Each cache comprises a cache array, cache directory and an associated cache controller that manages the transfer of data and instructions between the processor core or system memory and the cache. Typically, the cache directory also contains a series of bits utilized to track the coherency states of the data in the cache. In addition, during certain operations, a controlling “intermediate coherency state” that overrides the directory state for the cache line may be maintained by the cache controller logic during the completion of the operation.
With multiple caches within the memory hierarchy, a coherent structure is required for valid execution results in the MP. This coherent structure provides a single view of the contents of memory to all of the processors and other memory access devices, e.g., I/O devices. A coherent memory hierarchy is maintained through the use of a coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each coherency granule (e.g., cache line or sector) of at least all upper level (cache) memories. Each coherency granule can have one of the four MESI states, which is indicated by bits in the cache directory's SRAM or by intermediate coherency states within the cache controller.
In the MESI protocol, a cache line of data may be tagged with one of four states: “M” (Modified), “E” (Exclusive), “S” (Shared) or “I” (Invalid). The modified state indicates that a coherency granule is valid only in the cache storing the modified coherency granule and that the value of the modified coherency granule has not been written to system memory. When a coherency granule is indicated as exclusive, then only that cache has the coherency granule. The data in the exclusive state is consistent with system memory, however. If a coherency granule is marked as shared in a cache directory, the coherency granule is resident in the associated cache and potentially one or more other caches within the memory hierarchy, and all of the copies of the coherency granule are consistent with system memory and one another. Finally, the invalid state indicates that the data and address tag associated with a coherency granule are both invalid.
It is important to note that the present application makes a distinction between “instructions” that a processing element may execute, for example, to load data from a memory location or to store new data into a memory location and the “operations” these instructions may cause on the system interconnect linking the various processing elements within an MP. For example, a load instruction may, in the event of a cache miss, cause a READ operation on the system interconnect to be issued from the processing element executing the load instruction. The READ operation on the system interconnect causes a current copy of the data to be delivered to the issuing processing element and informs the other participants in the SMP that the data is merely being read, but not modified. If a load instruction hits in a cache, typically no operation is generated on the system interconnect and the data is returned to the processing element from the cache directly.
As another example, when a store instruction is executed and misses the cache, a RWITM (Read With Intent to Modify) operation is typically generated on the system interconnect. A RWITM operation on the system interconnect causes a current copy of the data to be delivered to the issuing processing element and informs any other participants in the SMP to invalidate their copies as they are about to become stale. If, however, the store instruction hits the line in the cache in a shared state, it typically issues a DCLAIM operation. The DCLAIM operation informs the other participants that the issuing cache wishes to gain ownership to update the cache line and that they should invalidate their copies. The DCLAIM operation does not return a copy of the cache line to the issuing cache since the issuing cache has a current copy of the line already. If the store instruction hits an M or E line in the cache, the line is owned, and only present in the current cache. The cache controller logic updates the line immediately and sets the cache state to M if the line was in the E state (the cache line is no longer consistent with memory and therefore cannot be left in the E state).
The state to which each coherency granule (e.g., cache line) is set is dependent upon both a previous coherency state of the data within the cache line and the type of memory access request received from a requesting device (e.g., the processor). Accordingly, maintaining memory coherency in the MP requires that the processors communicate messages across the system bus indicating their intention to read or to update a memory location. For example, when a processor desires to write a memory location, the processor must first inform all other processing elements of its intention to update the data in the memory location and receive permission from all other processing elements to carry out the update operation. The permission messages received by the requesting processor indicate that all other cached copies of the contents of the memory location have been invalidated, thereby guaranteeing that the other processors will not access a now stale local copy of the data.
Typical system interconnects are comprised of two distinct parts: an address portion utilized to transmit operations and individual and combined responses for those operations and a data portion utilized to transfer data between participants in the system. An operation is first broadcast on the address portion of the system interconnect. As the operation is broadcast, at each participant, an individual partial response to the operation is generated and these partial responses are combined into a “combined response” that is then broadcast to all the participants in the MP. The combined response indicates the overall success or failure of the requested operation. The time from the broadcast of the operation onto the address portion of the system interconnect to the receipt of the combined response by a participant is referred to as the “address tenure” for the operation.
Typical operations that affect the coherency state of cache lines include READs, RWITMs, DCLAIMs, and CASTOUTs (CO). A castout operation is used to evict a modified cache line back to main memory from a cache when a new line is being brought in a cache and is displacing the modified line.
Some operations, such as the DCLAIM operation described above, only require an address tenure to complete because no data is transferred. However, other operations, such as READ and RWITM, also require a subsequent data tenure on the data portion of the system interconnect after successful completion of the address tenure, in order to transfer data from one participant to another within the system. The data tenure for an operation commences when the data is placed on the data portion of the system interconnect from the sourcing participant and concludes when all the data is received and processed at the requesting participant.
Address operations on the address portion of the system interconnect are often allowed to proceed independently from data tenures in a pipelined fashion. In other words, subsequent address tenures can occur on the address portion of the system interconnect concurrently with a data tenure occurring on the data portion of the system interconnect that is associated with a previously successful address tenure. Such interconnects are commonly referred to as a “split-transaction” interconnect and are well known to those skilled in the art.
A data transfer operation usually consists of an address tenure and a data tenure between two participants: a sourcing participant and a requesting participant. To effect the data transfer, the requesting participant places a bus operation such as a READ or RWITM on the system interconnect requesting a copy of the line and, in the case of a RWITM, ownership permission to update a memory location within the cache line. During the address tenure of the request, other participants snoop the operation, produce a partial response, and, if possible, perform steps necessary to honor the request. The other participants utilize the partial response to indicate their ability to honor the request. In particular, for a cache-to-cache transfer, a cache with a current copy of the cache line activates cache controller logic necessary to deliver the data requested if appropriate and becomes the tentative sourcing participant. Other participant caches indicate their ability to remove the cache line if necessary (for example, for a RWITM).
If a participant cannot honor a request, the participant generates a “Retry” response. This response indicates that the participant cannot honor the request for whatever reason and that the request should be retried again at a later time. At the conclusion of the request address tenure, the combined response is generated from the individual partial responses and broadcast to the participants to indicate whether or not the request can be fulfilled. If the request cannot be fulfilled for some reason, the requesting master re-attempts the request at a later time and the tentative sourcing participant, if any, is released with no transfer occurring.
However, if the cache-to-cache data request can be fulfilled the cache control logic in the sourcing participant initiates a data tenure on the data portion of the system interconnect and transfers the data from the sourcing cache to the destination cache. Such a cache-to-cache transfer is referred to as an “intervention”. The data tenure completes when the data is received and processed by cache controller logic in the receiving participant. During the data transfer operation (address and data tenure), the cache directories for both the source and destination cache are updated to the proper coherency state based on the current states of the caches and the type of operation involved (i.e. READ or RWITM).
Typically, during the address and data tenure for a data transfer operation, subsequent address tenures targeted at the same cache line as the data transfer from other participants are retried. This is because the line is currently being transferred from one cache to another and is in a state of transition and the cache directory states are being updated. The data and address tenures for a given data transfer operation between a given sourcing and destination participant must typically be completed before subsequent data transfer operations for the given cache line may be processed.
While the above has described data transfers from one cache to another, those skilled in the art will appreciate that the memory controller can also source data to a requesting cache in a manner similar to that used when a sourcing cache intervenes to a destination cache. Transfers sourced from a memory controller proceed in the same manner as cache-to-cache transfers except that control logic in the memory controller is responsible for snooping the operation and for initiating the data tenure and no state update is performed in the memory controller, because coherency state information is not maintained within the memory controller.
As more processors are added on a bus, and depending on the application being run, there may be contention among processors for certain cache lines, such as those containing synchronization objects, etc. Each requesting processor continues to put the same request on the bus until access to the cache line data is provided to the requesting process. In such cases, a substantial amount of bus bandwidth is wasted on requests that have to be continually retried. The system bus becomes bogged down with this cycle of repeated access requests and associated retry responses.
Also, there is currently no way for the cache with current ownership of the cache line data to know/keep track of which request from the multiple requesting processors was snooped first. The processor that is sent the cache line following the completion of the data tenure may not be the processor who first requested the line. Inefficiencies are thus built into MPs that utilize the currently available MESI coherency protocol to track and coordinate data access operations within the memory hierarchy.
The present invention recognizes that it would be desirable to provide a method and system by which the latency of coherency response for subsequent, successive/sequential accesses to a cache line is hidden or substantially reduced. A cache coherency protocol that allows for continued coherency operations while the data is still being transferred to a previous master's cache would be a welcomed improvement. The invention further recognizes the desirability of reducing cyclical requests and retries on the system bus between a device requesting the cache line data and the master device when the master device does not yet have the data within its cache. These and other features are provided by the invention described herein.