1. Technical Field
The present invention relates generally to data processing systems and specifically to processor-to-cache updates within data processing systems. Still more particularly, the present invention relates to scheduling dispatch of store conditional operations utilized to complete processor-to-cache updates.
2. Description of the Related Art
Increasing efficiency of data operations at the processor-cache level is an important aspect of processor chip development. Modern microprocessors typically contain entire storage hierarchies (caches) integrated onto a single integrated circuit. For example, one or more processor cores containing level 1 (L1) instruction and/or data caches are often combined with an on-chip L2 cache. The L1 cache is typically a store-through cache and the L2 cache provides a coherent view of the memory hierarchy.
In a multiprocessor computer system (MP), the individual processors often need to write to certain shared memory locations of the MP in a synchronized fashion. Traditionally, this synchronization has been achieved by the processor altering the memory location utilizing an atomic “read-modify-write” operation. These operations read, modify, and then write the specific memory location in an atomic fashion. Examples of such operations are the well known “compare-and-swap” operation and the “test-and-set” operation.
In more recent MP systems, it has become difficult to ensure atomicity within a single operation. Therefore, in some conventional processors, atomicity is instead effected using a pair of instructions, referred to herein as LOAD_LOCKED (LARX) and STORE_CONDITIONAL (STCX) instructions. These instructions are used in sequence.
LARX and STCX instructions, while not atomic primitives in themselves, effect an atomic read-modify-write of memory by monitoring for any possible changes to the location in question between the LARX and STCX instructions. In effect, the STCX operation only succeeds when the LARX and STCX instructions execution produces an atomic read-modify-write update of memory. Those skilled in the art are familiar with the processing of LARX and STCX operations to affect atomic updates of memory. The following thus provides only a brief overview of the process.
The processing of a LARX/STCX instruction pair begins with the thread of execution issuing a LARX instruction. A LARX instruction is a special form of a load instruction that returns load data for the location requested and further instructs the memory coherence mechanism in the MP to monitor for writes that could potentially alter the read memory locations. In particular, the memory coherence mechanism will typically monitor for any write operations to the cache line containing the memory location or locations returned by the LARX instruction. The monitored region of memory is referred to as the “reservation granule” and typically, but not always corresponds to the size of a cache line. Furthermore, a LARX instruction also ensures that the data loaded is not stale (i.e., the value loaded is the most recent globally visible value for the location). If the value is stale, the subsequent STCX instruction will fail.
Once data is returned from a LARX instruction, the thread of execution typically, but not always, modifies the returned data within the registers of the processor core utilizing some sequence of arithmetic, test, and branch instructions corresponding to the particular type of atomic update desired (e.g. fetch-and-increment, fetch-and-decrement, compare-and-swap, etc.).
Next, the thread of execution typically issues a STCX instruction to attempt to store the modified value to the location in question. The STCX instruction will succeed only if (1) the coherence mechanism has not detected any write operations to the reservation granule between the LARX operation and the STCX operation and (2) the LARX operation initially returned a non-stale value for the location. If both of these conditions are met, the STCX instruction updates memory and a signal/message is returned to the processor core indicating the STCX was successful. If the STCX is not successful, a signal is returned to the processor core indicating the STCX failed and memory is not updated.
The thread of execution is usually stalled at the STCX instruction until the “pass” or “fail” indication for the STCX instruction is returned. Even in those cores that can execute instructions beyond a STCX that is waiting for its pass or fail indication, it is usually not possible to execute another LARX and STCX sequence because the coherence mechanism usually cannot easily provide tracking for more than one address per thread of execution at a time. Finally, the thread of execution typically examines the pass (or fail) indication for the STCX instruction and repeats the sequence of steps if the STCX operation failed.
In typical software processing, which includes processing LARX and STCX operations, the STCX is issued from the processor core and processed similarly to other regular store operations. To increase processing efficiency of store operations, conventional processor chips are often designed with a “store queue” that is typically placed between a processor core and the L2 cache and is used to process regular store operations as well as STCX operations. A store queue typically contains byte-addressable storage for a number of cache lines (usually 8 to 16 cache lines).
Store operations originate at the processor core and are temporarily held in an entry of the store queue. The store operations target a particular cache line (or portion of the cache line) identified by the address within the store operation, and the store operation also provides data to be stored within the addressed portion of that cache line (e.g., byte 12).
The store operations update particular bytes within the cache line entry in the store queue. Concurrent with these data updates, corresponding bits within byte enable register in the store queue are set to track which bytes within the cache line entry have been updated by store operations. Typically, a series of store operations writing to a same entry in the store queue are absorbed by the store queue entry before the line is dispatched to the L2 cache. This absorption of multiple store operations into a single entry is referred to as “gathering” stores, since multiple different stores addressing the same cache line are “gathered” into an entry of the store queue before the line is presented to the L2 cache for storage.
Unlike the normal store operation, however, a STCX is allocated to its own entry, and is one of several operations that are not allowed to gather within an entry. Due to the conditional nature of a STCX, it is impractical to enter the STCX data into an entry in the store queue with other non-STCX operations. Doing so would require significant additional bookkeeping resources to identify those bytes within the store queue entry that have potentially been altered by a STCX and the values these bytes should revert to in the event the STCX operation failed.
FIG. 2 illustrates a prior art representation of specific hardware and logic components of a processor chip that are utilized to complete data store operations. As illustrated, processor chip 201 includes a processor core 203, store queue 207 with store queue (STQ) controller 205, and read claim (RC) dispatch logic 219. STQ controller 205 includes arbitration logic 206 utilized for selecting entries from the store queue 207, as described below. RC dispatch logic 219 supports a series of RC machines 221, which complete the actual data store operations at the lower-level cache (not shown).
The store queue 207 provides several rows (entries) for temporarily storing and tracking processor-issued stores. Each row is divided into several columns that provide byte enable register 209, address register 211, data register 213, controls bits 215, and valid bit 217. Data register 213 and address register 211 store data issued from the processor core 203 and the corresponding data (memory) address, respectively. Processor-issued data updates (i.e., store operations) typically target only a small portion (i.e., 1-byte to 16-bytes) of a cache line compared to the entire cache line (typically 128-bytes). For example, it is possible for a processor-issued store operation to target only a single byte granule of a 128-Byte cache line to update, and cache line updates are completed via a combination of these individual store operations, which may occur sequentially. Byte enable register 209 includes a number of bookkeeping bits. Conventionally the number of bits corresponds to the number of individually addressable storage granules within a cache line. Thus, for example, for a 128-byte cache line entry and byte store operations, byte enable register 209 maintains 128 bits for tracking single-byte processor-issued stores. This enables tracking of specific bytes (or group of bytes) within a 128-byte cache line entry that is being updated by the processor.
The store queue arbitration logic 206 in the store queue controller 205 looks at all the available entries in the queue and determines which entries are eligible to be processed by the RC mechanism 225 based on a set of architectural rules. For instance, a younger store to the same address as an older store cannot be processed before the older store. Neither can stores bypass barrier operations. The arbitration logic 206 then selects one of the eligible stores to request to the RC mechanism 225 for further processing. The selection process is typically a round robin scheme that takes no account as to the age of the eligible store operations. If the RC mechanism 225 accepts the request, the store queue entry closes down its ability to gather and is removed from the queue. If the RC mechanism 225 rejects the request, the arbitration logic 206 then selects another eligible store, if one is available, or tries again with the same entry if there are no others.
Valid bit 217 indicates to STQ controller 205 when data within a particular row of the store queue 207 is valid, and valid bit 217 is checked before arbitration logic 206 selects a row of data (or an entry) to forward to RC Dispatch logic 219. Once a valid bit is set, arbitration logic 206 is able to select the entry regardless of whether additional stores to that cache line are being sent by the processor core and as long as the architectural rules for processing stores are observed. Control Bits 215 represents an assortment of additional bits that are utilized by STQ controller 205. The functionality of several of the above-described columns is referenced within the description of the data store operations below.
In the store queue described above, a STCX is given no consideration over any other store and must wait its turn to be selected for dispatch by STQ controller 205. Consequently, a processor core can be made to wait longer for a pass or fail indication for the STCX if there are other operations in the store queue.
FIG. 3A provides a flow chart illustrating the overall processing of a LARX/STCX instruction sequence. The process begins at step 341 and proceeds to step 343 at which the LARX operation is issued to read the desired location and inform the coherence mechanism to monitor for any writes to the reservation granule containing the desired location. Following, the STCX operation is issued to the store queue at step 345, and the STCX is allocated an entry within the store queue and the entry is marked valid for selection as shown at step 347. After the entry containing the STCX becomes eligible for dispatch based on architectural rules, the entry is eventually selected for dispatch by the arbitration logic as shown at step 349. The STCX is then dispatched to attempt to update the desired location at step 351. At step 353, a determination is made by the dispatch mechanism whether the STCX was successful. If the STCX was successful, then the cache line is updated with the data from the STCX operation and the processor is signaled of the success at step 355. However, if the STCX was unsuccessful (i.e., failed), the processor is signaled of the failure of the operation at step 357, and the processor responds accordingly. The process then ends at step 359.
FIG. 3B illustrates a process by which a STCX operation issued by a processor is assigned to an entry within the store queue. The process begins at step 301 and proceeds to step 303 at which a determination is made whether there is an available entry within the store queue to assign a next store operation. When all entries of the store queue have been assigned (i.e., there is no available entry to assign to a new store operation and no gatherable entry exists for that store operation), the processor core suspends issuance of new store operations to the queue until an entry becomes available, as indicated at step 305.
In conventional systems, a tracking mechanism is provided within the core and/or the store queue to track when there are available entries to assign to store operations being issued by the core. The core is thus able to suspend issuance of store operations when those operations cannot be buffered within the store queue.
Typically, an entry becomes available when the contents of that entry are dispatched to an RC machine. That is, an entry becomes available when an older cache line entry is removed from the store queue and sent to the L2 cache for storage therein. A variety of different policies (some described below) may be utilized to determine when cache lines are moved from the store queue to be stored in the L2 cache. One important consideration in this process is the status of the valid bit associated with the entry. An entry can only be selected if the valid bit associated with the entry is set.
Returning to decision step 303, when there is an available entry, the processor core issues the STCX operation to the store queue as shown at step 304. The STCX operation is received at the store queue, and an available (un-allocated) entry is allocated to the STCX operation, as shown at step 307. Then the entry's valid bit 217 is set at step 311 to signal ready for dispatch to arbitration logic 206. The process then ends at step 313.
When a cache line entry is removed from the store queue to be sent to the L2 cache, the cache line entry is assigned by RC dispatch logic 219 to RC state machine 221, which updates the cache line of the L2 cache with the data from within entry 207. Thus, for every RC machine 221 assigned to a store operation, the entire cache line must be read and manipulated regardless of how many bytes of the cache line are actually being updated.
FIG. 3C provides a flow chart of the processes involved in selecting an entry of the store queue to forward to the lower level cache. The process begins at step 321 and proceeds to step 323, at which the STQ controller scans the valid bits of each entry to see which entries are eligible for dispatch. A determination is made at step 325 whether there are valid entries eligible for selection by the arbitration logic. When there are valid entries (i.e., entries with their valid bit 217 set to logic high and are architecturally ready), the arbitration logic selects one entry for dispatch from among all eligible entries and forwards the selected entry to RC mechanism, as shown at step 327. The process then continues for other entries.
In determining which entry to select for dispatch, the arbitration logic looks at all the valid entries in the queue and determines, based on a set of architectural rules, which entries are eligible to be processed by the RC machine. The arbitration logic selects one of the eligible entries and signals the RC dispatch logic of the availability of that entry for dispatch to an RC machine. Conventional selection processes are typically via a round robin scheme amongst eligible entries. With this conventional approach, an entry that holds a newly issued STCX operation is given no higher priority than any other store within the store queue that is concurrently available for selection based on the architectural rules.
The above-described method of updating a cache line within the L2 cache with STCX operations yields a number of inefficiencies, particularly when other processes are arbitrating for write access to the same cache line. Frequently, as is known in the art, the reserved granule is updated by some other participant (processor, etc.) before the STCX operation completes its update to the line. Because of the latency involved in passing the STCX through the store queue mechanism, and the tendency for other processors to seek to update the same cache line, it is not uncommon for a STCX to fail. When this occurs, the processor is forced to reissue the operation pair (beginning with the LARX) and this requires extra use of processor resources and a measurable increase in latency when completing the update to the target cache line. In general, when processing LARX/STCX pairs it is desirable to minimize the window between the LARX and STCX operations as much as possible in order to help ensure no other writes to the reservation granule occur that would prevent the STCX operation from completing successfully.
The present invention recognizes the need for more efficient implementation of the LARX-STCX operations to reduce the occurrence of failed STCX operations and associated drain on processor resources. A method and system that reduces the latency between the completion of the LARX and the arrival at the cache of the STCX operation to update the cache line would be a welcomed improvement. These and other benefits are provided by the invention described herein.