1. Technical Field
The present invention relates in general to data processing and, in particular, to processors, methods and data processing systems having improved data access. Still more particularly, the present invention is related to processors, methods and data processing systems having improved store performance through implementation of a variable store gather window.
2. Description of the Related Art
Modern data processing systems typically employ multi-level volatile memory hierarchies to provide data storage. Many times, such memory hierarchies include one or more levels of low latency cache memory integrated within an integrated circuit together with one or more processor cores. The memory hierarchy may also contain one or more lower levels of external cache memory or system memory. For example, in some designs, one or more processor cores containing private level one (L1) instruction and data caches may share an on-chip L2 cache and be further supported by an off-chip L3 cache, as well as system memory (e.g., Dynamic Random Access Memory (DRAM)).
In data processing systems with on-chip caches, individual processor-issued store operations typically target only a small portion of a line of off-chip cache or system memory (e.g., 1 to 16 bytes of a 128-byte cache line). Updates to lines of lower level memory are therefore typically completed by a series of these individual store operations, which may occur sequentially.
In order to increase store performance, conventional processor chips are often equipped with a store queue containing byte-addressable storage for a line of lower level memory. Many store queues support so-called “store gathering” in which multiple store operations are collected within a particular queue entry before the line is transmitted to lower level cache or memory for storage. The gathering of multiple store operations in this manner is generally believed to advantageously reduce the number of store queue entries required to handle a given number of store operations, and to improve store performance by reducing the number of higher latency accesses to lower level memory.
The present invention recognizes that conventional implementations of store gathering do not provide uniform improvement in store performance for all workloads. For example, technical workloads with multiple streams of store operations, exemplified by benchmarks such as TRIAD, provide better performance when the time permitted for store operations to be gathered within a particular store queue entry (defined herein as a store gathering window) is relatively long. Commercial workloads, exemplified by the TPC-C benchmark, on the other hand, achieve better store performance with shorter store gathering windows. Consequently, conventional data processing systems in which the store gathering window is fixed for the life of the machine, cannot offer optimal store performance for different types of workloads.