The present invention relates generally to data processing and, in particular, to expedited servicing of store operations in a data processing system.
A conventional multiprocessor (MP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Cache memories are commonly utilized to temporarily buffer memory blocks that might be accessed by a processor in order to speed up processing by reducing access latency introduced by having to load needed data and instructions from system memory. In some MP systems, the cache hierarchy includes at least two levels. The level one (L1) or upper-level cache is usually a private cache associated with a particular processor core and cannot be accessed by other cores in an MP system. Lower-level caches (e.g., level two (L2) or level three (L3) caches) may be private to a particular processor core or shared by multiple processor cores.
In conventional MP computer systems, processor-issued store operations typically target only a small portion (i.e., 1 to 16 bytes) of a cache line rather than the entire cache line (e.g., 128 bytes). Consequently, an update to a cache line may include multiple individual store operations to sequential or non-sequential addresses within the cache line. In order to increase efficiency of store operations, processing units may include a coalescing store queue interposed between a processor core and a cache at which systemwide coherency is determined (e.g., the L2 cache), where the store queue provides byte-addressable storage for a number of cache lines (e.g., 8 to 16 cache lines). To reduce the number of store operations that must be performed in the cache (and potentially broadcast to other processing units), the store queue often implements “store gathering,” which is the combination of multiple store operations into a single store queue entry prior to making an update to the corresponding cache line in the cache.
While generally beneficial in terms of reducing the overall number of store accesses to the cache, the present disclosure recognizes that conventional store gathering within the store queue necessarily delays the store accesses requested by some store operations until store gathering of the corresponding store queue entries completes. The present disclosure recognizes that in some cases the delay in servicing store accesses attributable to store gathering can negatively impact performance of other instructions and/or threads.