1. Technical Field
The present invention relates in general to load operations by processors in a multiprocessor system and in particular to load operations which utilize data received prior to a coherency response window.
2. Description of the Related Art
Most contemporary high-performance data processing system architectures include multiple levels of cache memory within the storage hierarchy. Caches are employed in data processing systems to provide faster access to frequently used data over access times associated with system memory, thereby improving overall performance. Cache levels are typically employed in progressively larger sizes with a trade off to progressively longer access latencies. Smaller, faster caches are employed at levels within the storage hierarchy closer to the processor or processors, while larger, slower caches are employed at levels closer to system memory.
In multiprocessor systems, bus operations initiated by one processing segment—a processor and any in-line caches between the processor and the shared bus—are typically snooped by other processing segments for coherency purposes, to preserve data integrity within cache segments shared among the various processing segments. In such systems, a processor initiating a load operation may be required to wait for a coherency response window—for responses from other devices snooping the load operation—to validate data received in response to the load request.
In known processors, such as the PowerPC™ 620 and 630FP available from International Business Machines Corporation of Armonk, N.Y., the coherency response window is programmable from two bus cycles to as many as sixteen bus cycles. A timing diagram for a processor employing an eight cycle coherency response window or Time Latency to Address Response (TLAR) is depicted in FIG. 4. Such larger TLARs may be required for slow bus devices or bridges which need to get information from down stream.
Processors utilizing a snoopy bus may receive data before the coherency response window and hold the data, for example, in a buffered read queue or a bus interface unit, until the coherency response window. However, the processor may not use the buffered data due to possible invalidation in the coherency response window. Thus, the processor load operation is limited by the latency associated with the coherency response window. Processors receiving data concurrently with the coherency response window, on the other hand, eliminate the buffering but still incur the latency associated with the coherency response window.
Where only one or two cache levels are implemented in a data processing system, the latency associated with a coherency response window for a load operation may be acceptable since a longer latency may be required to source the requested data from system memory or a bridge device. The frequency of occasions when an L2 cache hits but the processor must wait for the coherency response window may, as a result of the L2 cache's small size, be too low to be a significant performance concern. Where more cache levels are implemented, however, such as an L3 cache, circumstances may change. A larger L3 cache should result in more cache hits, where requested data could be sent to the processor prior to the coherency response window. However, current architectures do not permit the data to be utilized by the processor prior to the TLAR.
It would be desirable, therefore, to provide a mechanism allowing data received by a processor to be used by the requesting processor prior to the coherency response is window.