1. Technical Field
The present invention relates generally to data processing systems and, in particular, to processor-cache operations within a multiprocessor data-processing system. Still more particularly, the present invention relates to SMP system optimization via efficient cache coherency operations.
2. Description of the Prior Art
A data-processing system typically includes a processor coupled to a variety of storage devices arranged in a hierarchical manner. In addition to a main memory, a commonly employed storage device in the hierarchy includes a high-speed memory known as a cache memory (or cache). A cache speeds up the apparent access times of the relatively slower main memory by retaining the data or instructions that the processor is most likely to access again, and making the data or instructions available to the processor at a much lower latency. As such, caches enable relatively fast access to a subset of data and/or instructions that were recently transferred from the main memory to the processor, and thus improves the overall speed of the data-processing system.
Most contemporary high-performance data processing system architectures include multiple levels of cache memory within the memory hierarchy. Cache levels are typically employed in progressively longer access latencies. Smaller, faster caches are employed at levels within the storage hierarchy closer to the processor (or processors) while larger, slower caches are employed at levels closer to system memory.
In a conventional symmetric multiprocessor (SMP) data processing system, all of the processors are generally identical, insofar as the processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system, as illustrated in FIG. 1A, may comprise a system memory 107, a plurality of processing elements 101A-101D that each include a processor and one (or more) level(s) of cache memory 103A-103D, and a system bus 105 coupling the processing elements (processors) 101A-101D to each other and to the system memory 107. Many such systems include at least one level of cache memory shared between two or more processors. Additionally, a xe2x80x9csharedxe2x80x9d cache line 109 may exist in each cache memory 103A-103D. To obtain valid execution results in a SMP data processing system, it is important to maintain a coherent memory hierarchy, that is, to provide a single view of the contents of memory to all of the processors.
A coherent memory hierarchy is maintained through the use of a selected memory coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each cache line of at least all upper level (cache) memories. Each coherency cache line can have one of four states, xe2x80x9cMxe2x80x9d (Modified), xe2x80x9cExe2x80x9d (Exclusive), xe2x80x9cSxe2x80x9d (Shared) or xe2x80x9cIxe2x80x9d (Invalid), which can be encoded by two bits in the cache directory.
FIG. 2 illustrates the MESI protocol and its state transition features. Under the MESI protocol, each cache entry (e.g., a 32-byte sector) has two additional bits which indicate the state of the entry, out of the four possible states. Depending upon the initial state of the entry and the type of access sought by the requesting processor, the state may be changed, and a particular state is set for the entry in the requesting processor""s cache. For example, when data in a cache line is in the Modified (M) state, the addressed data is valid only in the cache having the modified cache line, and the modified value has not been written back to system memory. When a cache line is in the Exclusive state, the corresponding data is present only in the noted cache, and is consistent with system memory. If a cache line is in the Shared state, the data is valid in that cache and in at least one other cache, with all of the shared data being consistent with system memory. Finally, when a cache line is in the Invalid state, the addressed data is not resident in the cache. As seen in FIG. 2 and known in the art, the state of the cache line transitions between the various MESI states depending upon particular bus or processor transactions.
There are a number of protocols and techniques for achieving cache coherence that are known to those skilled in the art. At the heart of all these mechanisms for maintaining coherency is the requirement that the protocols allow only one processor to have a xe2x80x9cpermissionxe2x80x9d (or lock) that allows a write to a given memory location (cache block) at any given point in time. As a consequence of this requirement, whenever a processor (or processing component) attempts to write to a memory location, the processor must first inform all other processing components of the processor""s desire to write into a cache line and invalidate all other processing components"" cache line (to the same address).
To implement cache coherency in a system, the processors communicate over a common generalized interconnect (i.e., system bus 105). The processors pass messages over the interconnect indicating their desire to read or write memory locations. When an operation is placed on the interconnect, all of the other processors xe2x80x9csnoopxe2x80x9d (monitor) this operation and decide if the state of their caches can allow the requested operation to proceed and, if so, under what conditions. There are several bus transactions that require snooping and follow-up action to honor the bus transactions and maintain memory coherency. The snooping operation is triggered by the receipt of a qualified snoop request, generated by the assertion of certain bus signals. Instruction processing is interrupted only when a snoop hit occurs and the snoop state machine determines that an additional cache snoop is required to resolve the coherency of the offended sector.
This communication is necessary because, in systems with caches, the most recent valid copy of a given block of memory may have moved from the system memory to one or more of the caches in the system (as mentioned above). If a processor attempts to access a memory location not present within its cache hierarchy, the correct version of the block, which contains the actual (current) value for the memory location, may either be in the system memory or in one of more of the caches in another processing unit. If the correct version is in one or more of the other caches in the system, it is necessary to obtain the correct value from the cache(s) in the system instead of system memory.
For example, with reference to FIG. 1A, a read transaction that is issued against cache line 109 by P0 (processor 101A) and subsequent coherency operations would evolve as follows. P0 first searches its own L1 cache 103A. If the cache line is not present in the L1 cache 103A, the request is forwarded to the L2 cache, then the L3 cache and so on until the request gets is presented on the generalized interconnect (system bus 105) to be serviced by one of the other processors or the system memory. Once an operation has been placed on the generalized interconnect, all other processing units P1-P3 snoop the operation and determine if the block is present in their caches. If a given processing unit has the block of data requested by P0 in its L1 cache, and that data is modified, by the principle of inclusion the L2 cache and any lower level caches also have copies of the block (however, their copies are stale, since the copy in the processor""s cache is modified). Therefore, when the lowest level cache (e.g., L3) of the processing unit snoops the read instruction, it will determine that the block requested is present and modified in a higher level cache. When this occurs, the L3 cache places a message on the generalized interconnect informing the processing unit that the processing unit must xe2x80x9cretryxe2x80x9d its operation again at a later time, because the actual value of the memory location is in the L1 cache at the top of the memory hierarchy and must be retrieved to make it available to service the read request of the initiating processing unit, P0. (In some systems, xe2x80x9cretryxe2x80x9d bus operation may be replace by a data interaction operation.
Once the request from an initiating processing unit has been retried, the lower level cache begins a process to retrieve the modified data from the L1 cache and make it available. P0 eventually presents the read request on the generalized interconnect again. At this point, however, the modified data has been retrieved from the L1 cache of a processing unit and the read request from the initiating processor will be satisfied.
The essential point is that, when a processor wishes to read or write a block, it must communicate that desire with the other processing units in the system in order to maintain cache coherence. To achieve this, the cache coherence protocol associates with each block in each level of the cache hierarchy, a status indicator indicating the current xe2x80x9cstatexe2x80x9d of the block. The state information is used to allow certain optimizations in the coherency protocol that reduce message traffic on the generalized interconnect and the inter-cache connections.
As one example of this mechanism, when a processing unit executes a read, the processing unit receives a message indicating whether or not the read must be retried later. If the read operation is not retried, the message usually includes information allowing the processing unit to determine if any other processing unit also has a still active copy of the block (this is accomplished by having the other lowest level caches give a xe2x80x9csharedxe2x80x9d or xe2x80x9cnot sharedxe2x80x9d indication for any read that do not retry). Therefore, a processing unit can determine whether any other processor in the system has a copy of the block. If no other processing unit has an active copy of the block, the reading processing unit marks the state of the block as xe2x80x9cexclusivexe2x80x9d. If a block is marked exclusive it is permissible to allow the processing unit to later write the block without first communicating with other processing units in the system because no other processing unit has a copy of the block. Therefore, it is possible for a processor to read or write a location without first communicating this intention on the interconnection, but only where the coherency protocol rules are met.
The foregoing cache coherency technique is implemented in the prior art MESI protocol and illustrated in FIG. 2 and described above. A cache line can become Invalid (e.g., from the Shared state) if the cache snoops an operation from a different processor indicating that the value held in the cache block is to be modified by the other processor, such as by snooping a Read-With-Intent-To-Modify (RWITM) operation.
Some processor architectures, including the PowerPC(trademark) processor, allow the execution of one or more special operations, other than the RWITM operation, when a processor wants to claim a memory block for a future store instruction (modifying the block). The xe2x80x9cDClaimxe2x80x9d operation is one example. The DClaim operation is used in lieu of the RWITM bus transaction when a valid value for the subject block is already held in the same processor""s cache, e.g., in a Shared state (if the value were currently held in a Modified or Exclusive state, there would be no need to broadcast either a RWITM or DClaim request since the processor would already have exclusive control of the block). The processor may be adapted to execute a DClaim operation after checking to see if the valid value is resident in examining its on-board (L1) cache. If not, the processor can issue a RWITM request, and any lower level cache having the valid value will, upon receiving the RWITM request, convert it into a DClaim operation to be passed to the system bus. The DClaim operation accordingly is an address-only operation since the value does not need to be read (from system memory or any intervening cache). Because of this attribute, the DClaim operation is more efficient than a RWITM operation, which would force the read operation across the system bus. When another cache has the same addressed block in a valid (Shared) state and snoops a DClaim transaction for the block, that other cache switches to its corresponding block to an Invalid state, releasing the block so that the requesting processor can proceed to modify the value. In other words, a DClaim transaction appears just like a RWITM operation from a non-intervening snooper.
One problem with DClaim-type coherency operations is that they occasionally (sometimes frequently) suffer significant performance degradation, since completion of the operation can be delayed by coherency responses from other devices in the memory hierarchy. For example, if several caches of different processing units are previously coherenting a value in Shared states and they snoop a DClaim operation, their respective processors may repeatedly issue retry messages in response to the DClaim snoop (if these processors are currently busy or otherwise unable to handle the snoop, for whatever reason).
With reference again to FIG. 1A, an example of the coherency response to a modification of a shared cache line is provided. FIG. 1A provides a 4-way symmetric multiprocessor system (SMP) 100 in which each of the processor""s cache contains a particular cache line 109 in a shared (S) state. In the illustrated SMP 100 of FIG. 1A, processors P0-P3 are depicted, each having an exemplary cache line 109 that is initially in the shared (S) state of the MESI protocol. During operation, P0 issues a store/write operation for cache line 109 (e.g., ST A). Then, P0 acquires a xe2x80x9clockxe2x80x9d on the cache line 109. After P0 acquires the lock, the store operation is snooped by the other processors, P1-P3, and each processor changes the coherency state of its local cache line to I and issues a read request for the cache line in P0""s cache 103A per the MESI protocol. The store operation causes a DClaim of shared cache line 109, and the DClaim is issued to the system bus. Meanwhile, the read requests are issued on the system bus 109 to acquire the modified cache line. Each of the issuing processors P1-P3 waits for a flag to be set, which indicates that the processor has an opportunity to acquire the lock on the cache line 109 and can get the modified data from P0. All the processors P1-P3 are therefore contending for the same lock on the bus, i.e. all are polling for the same flag. Meanwhile, P0 waits until a xe2x80x9cnullxe2x80x9d response is received in response to the DClaim. If the null response is not received, then the DClaim operation is retried.
When a null response is received, P0""s coherency state is changed from S to modified (M). According to current architecture and operational procedures, once the store/write operation is snooped, all the other processors commence issuing reads out to the system bus. Thus the reads are issued in parallel and generally overlap on the system bus. With very large SMPs, e.g. 32-way or 64-way SMPs, the automatic issuance of reads and retries results in the near simultaneous issuing of 32 or 64 read requests to the system bus requiring substantial amounts of system bus bandwidth and utilization of processor resources. Further, because of the large number of requests, significant hardware and software development is required to ensure decent performance and maintain proper cache coherency in these larger systems.
With larger multiprocessing systems, the processors may operate asynchronously, i.e., independent of other processors, in order to achieve higher performance. This adds another level of complexity to the problems of bus utilization for finite amounts of system bus bandwidth to maintain coherency among processor caches.
Returning now to the above-described process, once P0 completes the store operation, P0 releases the lock and P1 acquires the lock from P0 (i.e., P1""s flag is set). Read requests from P2 and P3 continue to be retried while P0 intervenes the data to the P1 cache. Then, P0""s cache state changes from M to S, and P1""s cache state goes from I to S. P1 may then DClaim cache line 109. P1""s cache coherency state goes from S to M. Meanwhile, P2 and P3 are still retrying their read A requests until data is intervened to P2. The process then continues with P3 retrying the read A request until data is intervened from P2. P2""s cache coherency state goes from S to I then back to S. Likewise, P3""s coherency state also goes from S to I then back to S. With large processing groups, the continuing retries of reads on the system bus until lock acquisition occurs and associated coherency state changes in such a serial manner ties up a large amount of processor resources. As described above, with a 32-way SMP, for example, thirty one different lock acquisition processes may be required along with substantial amounts of coherency operations and arbitration for the bus due to multiple retries from each processor attempting to acquire the lock.
Also, the amount of time required to complete the process in such a serial manner may result in the earlier processors, e.g., P0 restarting another store operation before all later processors acquire a lock in response to the previous store operation. Thus, processors are held up in a bottleneck of the system bus and the serialized processing and typically have the previously coherent data for most of the time. Subsequent requests for the cache line by the other processors must wait until the modified data is provided to that processor""s cache in the serial manner described. This has the effect of significantly reducing system performance.
The present invention recognizes that it would be desirable to provide a method and system for implementing dynamic microprocessor system optimizations for data bus operations. A method and system that enables hardware and/or software optimization of processor operations involving super-coherent states for greater data coherency would be a welcomed improvement. These and other benefits are provided by the invention described herein.
Disclosed is a cache coherency protocol and operational characteristics of a multiprocessor data processing system that: (1) reduces the number of coherency operations on the system bus of a multiprocessor data processing system in response to the modification of a cache line; and (2) enables utilization of xe2x80x9csuper-coherentxe2x80x9d cached data by a cache coherent microprocessor. Super-coherent cache data is data which had previously been cache coherent, but allowing the processors to use the non-coherent data in a cache coherent programing manner. The invention permits processors to continue utilizing super-coherent data while another processor is actively modifying the data.
The coherency protocol provides two additional coherency states that indicate specific status of super-coherent cached data following a modification of a corresponding cache line in another processor""s cache. The first coherency state, Z1, indicates that the corresponding cache line has been modified in another cache and forces the processor to issue a Z1 read of the cache line to the system bus to determine whether or not data in the modified cache line may be immediately acquired. The second coherency state, Z2, indicates that the data in the cache line is super-coherent respective to the modified data in the next cache, but informs the processor that the processor should utilize the super coherent data to complete its processing operations.
Additionally, a set of new snoop responses and supporting logic are provided on the system bus for coherency operations (i.e., Z1 reads snooped on the system bus). The new responses are: xe2x80x9cuse super-coherent (previous) dataxe2x80x9d and xe2x80x9cuse coherent (new) dataxe2x80x9d and are issued in response to a Z1 read request from a processor attempting to acquire the modified cache line but being able to use the previous data if the modified cache cannot be xe2x80x9cquicklyxe2x80x9d acquired (i.e. no retries).
When one of several possible modifications to the cache line in the first processor is snooped, all other processors sharing that cache line changes the coherency state of their cache line to Z1. Subsequently, when the other processors are loading data to the same cache line, a Z1 read is issued on the system bus. When the first processor snoops the Z1 read requests on the system bus, the first processor issues a lock to one of the requesting processors (in the preferred embodiment, the first request that is received is selected), and then signals a xe2x80x9cuse super-coherent dataxe2x80x9d response to all the other requests. The selected processor receives a xe2x80x9cuse new dataxe2x80x9d (or lock acquired) response and is later given the lock on the modified cache line. The cache line is then acquired by the selected processor when the first processor issues a DClaim of the cache line on the system bus.
In one embodiment, the Z1 and Z2 states are maintained within a separate Z1/Z2 directory associated with the main cache directory. The Z1/Z2 directory stores a copy of cache line addresses/address tags for cache lines that are in the Z1 or Z2 state and tracks which of the two states the cache line is in. Processor requests are sent to the Z1/Z2 directory simultaneously with the main directory. Although this only allows a small amount of the main directory to be in the Z1/Z2 states, it provides an easy mechanism to quickly clear the Z1/Z2 cache states. One could implement the Z1/Z2 states in the main cache directory, but whenever all of the Z1/Z2 cache states need to be cleared, significant directory bandwidth may be consumed.
The coherency state of all the other processors that receive the xe2x80x9cuse super-coherent dataxe2x80x9d response is set to Z2, and the other processors with the cache line in the Z2 state operate with the super-coherent data until the Z2 state changes. In the preferred embodiment, the other processors continue to utilize the super-coherent data until the processor goes to the system bus to complete an operation and then issues a barrier instruction. When this sequence of events occur, the coherency state of all the cache lines within the cache that were in a Z1 or Z2 states is automatically changed to reflect the I state. Where the Z1 and Z2 states are stored in a Z1/Z2 directory, this operation is completed as a flush (or invalidate) of all contents of the Z1/Z2 directory.
Monitoring the occurrence of the above sequence of events is made easier by providing a clear_on_barrier_flag (COBF) associated with the Z1/Z2 directory which is set whenever a processor operation is issued to the system bus. Thus, if a barrier instruction is encountered while the COBF is set, the entire Z1/Z2 directory is immediately flushed (or invalidated), and the coherency state of the corresponding cache lines is read from the main directory where they are stored with an I state.
In another embodiment, additional system optimizations are provided, including, for example, read operations with specific directives. Enhanced reads (or read requests) are provided with added bit(s). The bit(s) indicates whether the read may be completed with super-coherent data or only with coherent data if the data is in an I, Z1, or Z2 state. The enhanced read may also be utilized in embodiments without the new cache states, but is preferably utilized with embodiments in which the new cache states are provided. Additionally, a specialized store instruction with additional bits is provided for utilization by a processor with a cache line in the modified state who wishes to release the lock on the cache line to a next processor whose cache line may be in the Z2 state. When the bits are set, issuing the store instruction to the system bus triggers the next processor(s) to change its coherency state from Z2 to Z1. A Z1 read is issued and the Z1 read is provided a lock on the cache line. Notably, the coherency state of the cache line of the issuing processor following the release of the lock store operation goes from M to I (and not S), while the coherency state of the cache line of the requesting processor changes from Z2 to Z1 to M.
In a data processing system having shared caches among processor groups, additional coherency states are utilized to reflect processor specific Z1/Z2 states within the Z1/Z2 directory. Each Z2 state then signals a specific processor to utilize previously coherent data while other processors within the group may still issue Z1 reads out to the system bus. When a next processor sharing the cache desires to access the cache line, the next processor issues a system bus read for that cache line, and if a xe2x80x9cuse super-coherent dataxe2x80x9d response is received, then that processor will also be provided a Z2 designation for cache line access and thereafter utilize the super-coherent data. Also, if a lock is acquired by any one of the processors, the subsequent modification of the cache line for that processor forces a group change of the Z1/Z2 cache states to reflect the new state (e.g., M).
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.