1. Technical Field
The present invention relates generally to data processing systems and specifically to processor-cache operations within data processing systems. Still more particularly, the present invention relates to improved gathering of processor-cache store operations to an entry of a store queue.
2. Description of the Related Art
Increasing efficiency of data operations at the processor-cache level is an important aspect of processor chip development. Modern microprocessors typically contain entire storage hierarchies (caches) integrated onto a single integrated circuit. For example, one or more processor cores containing L1 instruction and/or data caches are often combined with a shared on-chip L2 cache. In some designs, the directory portion of an L3 cache is also integrated on-chip with the data portion of the L3 cache residing in a separate external chip.
In systems with on-chip caches, processor-issued data store operations typically target only a small portion (i.e., 1-byte to 16-bytes) of a cache line compared to the entire cache line (typically 128-bytes). For example, it is possible for a processor-issued store operation to target only a single byte granule of a 128-Byte cache line to update, and cache line updates are completed via a combination of these individual store operations, which may occur sequentially. In order to increase efficiency, processor chips are often designed with a “store queue” that is typically placed between a processor core and the L2 cache. A store queue typically contains byte-addressable storage for a number of cache lines (usually 8 to 16 cache lines).
FIG. 2 illustrates a prior art representation of specific hardware and logic components of a processor chip that are utilized to complete data store operations. As illustrated, processor chip 201 includes a processor core 203, store queue 207 with store queue (STQ) controller 205, and read claim (RC) dispatch logic 219. STQ controller 205 includes arbitration logic 206 utilized for selecting entries from the store queue 207, as described below. RC dispatch logic 219 supports a series of RC machines 221, which complete the actual data store operations at the lower-level cache (not shown).
The store queue 207 provides several rows (entries) for temporarily storing and tracking processor-issued stores. Each row is divided into several columns that provide byte enable register 209, address register 211, data register 213, controls bits 215, and valid bit 217. Data register 213 and address register 211 store data issued from the processor core 203 and the corresponding data (memory) address, respectively. Byte enable register 209 includes a number of bookkeeping bits. Conventionally the number of bits corresponds to the number of individually addressable storage granules within a cache line. Thus, for example, for a 128-byte cache line entry and byte store operations, byte enable register 209 maintains 128 bits for tracking single-byte processor-issued stores. This enables tracking of specific bytes (or group of bytes) within a 128-byte cache line entry that is being updated by the processor.
Valid bit 217 indicates to STQ controller 205 when data within a particular row of the store queue 207 is valid, and valid bit 217 is checked before arbitration logic 206 selects a row of data (or an entry) to forward to RC Dispatch logic 219. Once a valid bit is set, arbitration logic 206 is able to select the entry regardless of whether additional stores to that cache line are being sent by the processor core. Control Bits 215 represents an assortment of additional bits that are utilized by STQ controller 205. The functionality of several of the above-described columns is referenced within the description of the data store operations below.
Store operations typically originate at the processor core 203 and are temporarily stored in an entry of the store queue 207. The store operations target a particular cache line (or portion of the cache line) identified by the address within the store operation, and the operation also provides data to be stored within the addressed portion of that cache line (e.g., byte 12).
The store operations update particular bytes within the cache line entry. Concurrent with these data updates, corresponding bits within byte enable register are set to track which bytes within the cache line entry have been updated by store operations. Typically, a series of store operations writing to a same entry in the store queue are absorbed by the store queue entry before the line is dispatched to the L2 cache. This absorption of multiple store operations into a single entry is referred to as “gathering” stores, since multiple different stores addressing the same cache line are “gathered” into an entry of the store queue buffer before the line is presented to the L2 cache for storage.
FIG. 3A illustrates a process by which store operations issued by a processor are assigned to an entry within the store queue. The process begins at step 301 and proceeds to step 302 at which a determination is made whether there is an available entry within the store queue to assign a next store operation. When all entries of the store queue have been assigned (i.e., there is no available entry to assign to a new store operation and no gatherable entry exists for that store operation), the processor core suspends issuance of new store operations to the queue until an entry becomes available, as indicated at step 303.
An entry becomes available when the contents of that entry are dispatched to an RC machine. That is, an entry becomes available when an older cache line entry is removed from the store queue and sent to the L2 cache for storage therein. A variety of different policies (some described below) may be utilized to determine when cache lines are moved from the store queue to be stored in the L2 cache. In conventional systems, a tracking mechanism is provided within the core and/or the store queue to track when there are available entries to assign to store operations being issued by the core. The core is thus able to suspend issuance of store operations when those operations cannot be buffered within the store queue.
Returning to decision step 302, when there is an available entry, the processor core issues a store operation to the store queue as shown at step 304. The store operation is received at the store queue, and a determination is made at step 305 whether a previously existing entry (for the same cache line address) is currently available for gathering the store operation. If, at step 305, there is no existing entry available to gather the store operation, a new entry is allocated to the store operation, as shown at step 307. However, when there is an existing entry that is gatherable, the entry is updated with the data of the store operation as shown at step 309.
An existing entry is usually available for gathering when the entry holding previously issued store operation(s) for the same cache line address has not yet been selected for dispatch to an RC machine. In conventional implementations, once an entry in the store queue has been assigned to a target cache line, subsequent stores targeting that cache line are gathered within that entry until a condition occurs that prevents further gathering of store operations to that entry. The STQ controller 205 controls when stores to a cache line are allowed to gather. For example, the STQ controller may prevent further gathering of stores to an entry when the entry is selected for dispatch. Also, gathering is typically stopped when a barrier operation is encountered, as is known to those skilled in the art.
Gathering stores is more efficient than individually storing single bytes within the L2 cache. This is because the RC machine's updating of a cache line with data from a store queue entry takes more cycles than the number of cycles required for updating the store queue entry with a new store operation. Also, each store operation at the L2 cache requires the RC machine to retrieve the entire cache line even when the store queue entry includes only a single store operation.
When a cache line entry is removed from the store queue to be sent to the L2 cache, the cache line entry is sent to an RC dispatch and assigned to an RC state machine, which updates the cache line of the L2 cache with the data from within the entry. Thus, for every RC machine assigned to a store operation, the entire cache line must be read and manipulated regardless of how many bytes of the cache line are actually being updated. It is thus more efficient to absorb multiple stores in the store queue entry before passing the line onto the L2 cache. Gathering stores also reduces the number of RC machine tenures required for store operations to a single cache line and also reduces the time required to update a cache line when multiple processor-issued stores update individual portions of the same cache line.
FIGS. 3B and 3C provide flow charts of the processes involved in completing a store operation from the store queue. FIG. 3B illustrates the general process for selecting an entry at the store queue for dispatch. The process begins at step 321 and proceeds to step 322, at which the STQ controller scans the valid bits of the entries to see which entries are eligible for dispatch. A determination is made at step 323 whether there are valid entries eligible for selection by the arbitration logic. When there are valid entries (i.e., entries with their valid bit 217 set/high), the arbitration logic selects an entry for dispatch from among all eligible entries and forwards the selected entry to the RC machine, as shown at step 324.
In determining which entry to select for dispatch, the arbitration logic looks at all the valid entries in the queue and determines, based on a set of architectural rules, which entries are eligible to be processed by the RC machine. For instance, an entry containing more recent stores cannot be processed before the entry with older stores to the same address, nor can store operations bypass barrier operations. The arbitration logic selects one of the eligible (valid) entries and signals the RC dispatch logic of the availability of that entry for dispatch to an RC machine. Conventional selection processes are typically via a round robin scheme amongst eligible entries.
Returning to FIG. 3B, a determination is made at step 325 whether the dispatch was successful. If the RC dispatch logic 219 accepts the request, the gathering of stores to that entry is stopped and the data within the entry is removed from the store queue, as depicted step 327. The RC dispatch 219 assigns the store to one of the RC machines 221 to complete the cache line update. If the RC dispatch rejects the request, the arbitration logic then selects another eligible store, if one is available, or tries again with the same entry if there are no others. When the dispatch was not successful, the process loops back to selecting a valid entry to send to the RC dispatch logic 219. For a successful dispatch, the valid bit, byte enable (BE) register, and other registers of the dispatched entry are reset, and the entry is made available for gathering a new set of store operations.
The RC machine 221 goes through several steps to update the L2 cache with the new store data. These steps are illustrated within the flow chart of FIG. 3C, which begins at step 331. The RC machine first determines at step 333 if a cache hit occurs, i.e., if the cache line is present in the L2 cache. If the line is not present in the cache, the RC machine places an address operation (with data request) on the system interconnect/bus that requests a copy of the cache line and write-permission for the cache line, as shown at step 343.
A determination is made at step 345 whether the request for write permission was successful. If the request was not successful, the request is reissued until the L2 cache is granted the necessary write permission. Notably, obtaining write permission when a miss occurs at the cache (i.e., the cache line is not present within the cache) requires a data operation to obtain a copy of the latest version of the cache line data. Also, in most instances, the coherency status of the other caches is updated/changed to indicate that the present L2 cache has current write permission.
Returning now to step 333, when the cache line is initially present within the L2 cache (i.e., a cache hit), a determination is made at step 335 whether there is permission to write to the cache line within the L2 cache. This check is required since the cache may not have permission to perform updates to the cache line, although the request by the RC machine hits within the cache. The RC machine thus issues an address-only operation on the bus to gain the write permission, as depicted at step 337.
A determination is made at step 339 whether the request for write permission was successful. If the request for write permission was not successful, a next determination is made at step 341 whether the line is still present within the L2 cache. In some instances (e.g., when the cache line request hits in the L2 cache but becomes stale before write permission can be obtained), a fetch of the data is required. When the line is still present in the L2 cache, the address-only write operation is retried. However, if the line is no longer present within the cache, the process shifts to step 343 which issues a request for both a copy of the line as well as write permission to the line.
All of the foregoing operations provide a copy of the targeted cache line within the L2 cache and provides the RC machine 221 with the necessary permissions to update the line. In general, the RC machine 221 only has to retrieve a copy of the data for the line when the line is not initially present within the cache. However, as described above, there are some cases where the cache line's data is updated by some other participant (processor, etc.) before the RC machine obtains write permission, and the RC machine must request a copy of the newly updated data from the other cache (or memory). When the cache line is present in the cache with sufficient write permission to immediately update the cache line data (i.e., a cache line hit with write permission—from step 335), no request for data and write permission is required to be issued to the system bus.
Returning to FIG. 3C, once the write permission is obtained and a current copy of the data is present to complete the updates, the RC machine retrieves the entire cache line from the cache and updates the portions of the line that are identified by the byte enable bits as having been updated within the store queue entry, as indicated at step 347. The process then ends as depicted at step 349.
Conventionally, the byte enable bits are utilized to select individual byte multiplexers (MUXes) for each byte of the cache line. The MUXes select either the old cache line data or the new data from the store queue entry based on which one of the byte enable bits are set. Finally, the updated cache line data is written back into the L2 cache with the data updates from the store queue and the RC machine is retired.
Current implementation of the gathering and entry selection processes within conventional store queues includes a number of built-in inefficiencies. For example, the arbitration logic that selects which entries to forward to the RC dispatch does not presently account for whether or not the entry being selected has gathered a full set of store operations to complete an update of the entire cache line. There is no consideration of whether the valid entries have had sufficient time to gather subsequent stores to the same store queue entry. The present operation of the store queue does not include any mechanism for allowing a sequence of stores targeting a same entry to gather before the entry is selected for dispatch. It is thus common for the arbitration logic to select a “valid” entry for dispatch while there are still incoming stores that are gatherable. Once an entry is dispatched to the RC machine, the STQ controller prevents any further gathering of stores to that entry, and the store queue is forced to allocate a second entry to the subsequent store operations updating the same cache line.
Processing of certain types of code, such as scientific code, typically yields sequential streams of stores which modify an entire cache line. Removing a partially full entry, while there are processor-issued stores still targeting that entry, yields inefficiencies at the gathering and cache-updating stage. Thus, whereas a single entry could complete a gather of all the stores to a cache line, two or more entries may be required to gather stores targeting portions of the same cache line. These multiple entries also require multiple RC machine tenures to complete the update of the L2 cache and result in longer latencies when completing updates to a single cache line.
The present invention recognizes the benefits of providing a more efficient gathering of stores to a single entry of a store queue. The invention further recognizes that it would be desirable to provide a method and system for gathering stores to an entry that substantially increases the likelihood of gathering a full set of stores for updating an entire cache line before the entry is selected for dispatch. A method and system that reduces the number of RC machine tenures when completing processor-updates to the lower-level cache would be a welcomed improvement. These and other benefits are provided by the invention described herein.