1. Technical Field
The present invention relates generally to data processing systems and specifically to processing store operations within a processor chip. Still more particularly, the present invention relates to an improved system and method of reordering store operations within a processor chip for more efficient processing.
2. Description of the Related Art
Increasing efficiency of data operation at the processor-cache level is an important aspect of processor chip development. Modern microprocessors typically include entire storage hierarchies (caches) integrated into a single integrated circuit. For example, one or more processor cores containing L1 instruction and/or data caches are often combined with a shared on-chip L2 cache.
In systems with on-chip caches, processor issued data store operations typically target only a small portion (i.e., 1-byte to 16-bytes) of a cache line compared to the entire cache line (e.g., typically 128-bytes). For example, it is possible for a processor-issued store operation to target only a single byte granule of a 128-byte cache line to update, and cache line updates are completed via a combination of these individual store operations, which may occur sequentially. In order to increase efficiency, processor chips are often designed with a “store queue” that is typically placed between a processor core and the L2 cache. A store queue typically contains byte-addressable storage for a number of cache lines (usually 8 to 16 cache lines).
FIG. 2 illustrates a prior art representation of specific hardware and logic components of a processor chip that are utilized to complete data store operations. As illustrated, processor chip 201 includes a processor core 203, store queue 207 with store queue (STQ) controller 205, and read-claim (RC) dispatch logic 219. STQ controller 205 includes write pointer 204 and read pointer 206 for selecting entries within store queue 207. The operation of both write pointer 204 and read pointer 206 are discussed herein in more detail in conjunction with FIGS. 3A and 3B. RC dispatch logic 219 supports a series of RC machines 221, which complete the actual data store operations at the lower-level cache (not shown).
Store queue 207 provides several rows (entries) for temporarily storing and tracking processor-issued stores. Each row is divided into several columns that provide byte enable register 209, address register 211, data register 213, controls bits 215, and valid bit 217. Data register 213 and address register 211 store data issued from processor core 203 and the corresponding data (memory address), respectively. Byte enable register 209 includes a number of bookkeeping bits. Conventionally, the number of bits corresponds to the number of individually addressable storage granules within a cache line. Thus, for example, for a 128-byte cache line entry and byte store operations, byte enable register 209 maintains 128 bits for tracking single-byte processor-issued stores. This enables tracking of specific bytes (or group of bytes) within a 128-byte cache line entry that is being updated by processor core 203.
Valid bit 217 indicates to STQ controller 205 when data within a particular row of store queue 207 is valid, and valid bit 217 is checked before a row of data (or an entry) is forwarded to RC dispatch logic 219. Once a valid bit is set, the row of data (or an entry) is selectable regardless of whether additional stores to that cache line are being sent by processor core 203. Control bits 215 represent an assortment of additional bits that are utilized by STQ controller 205. The functionality of several of the above-described columns is referenced within the description of the data store operations below.
Store operations typically originate at processor core 203 and are temporarily stored in an entry of store queue 207. The store operations target a particular cache line (or portion of the cache line) identified by the address within the store operation, and the operation also provides data to be stored within the addressed portion of that cache line (e.g., byte 12).
The store operations update particular bytes within the cache line entry. Concurrent with these data updates, corresponding bits within byte enable register 209 are set to track which bytes within the cache line entry have been updated by store operations. Typically, a series of store operations writing to a same entry in the store queue are absorbed by the store queue entry before the line is dispatched to the L2 cache. This absorption of multiple store operations into a single entry is referred to as “gathering” stores, since multiple different stores addressing the same cache line are “gathered” into an entry of the store queue buffer before the line is presented to the L2 cache for storage.
FIG. 3A is a high-level logical flow chart illustrating the operation of write pointer 204 as implemented within STQ controller 205 during a store operation in accordance with the prior art. The process begins at step 300 and continues to step 304, which depicts processor core 203 issuing a store operation to store queue 207. However, if STQ controller 205 determines that store queue 207 is full, STQ controller 205 sends a message instructing processor core 203 to halt the sending of store operations until some entries have been dispatched by RC dispatch logic 219.
The process then continues to step 306, which illustrates STQ controller 205 determining whether or not a gatherable entry is available. As referenced above, a “gatherable” entry is a single entry in which multiple store operations may be absorbed into that single entry before the entry is presented to the L2 cache for storage. If STQ controller 205 determines that a gatherable entry is available, the process continues to step 307, which depicts the store operations being gathered into an existing entry
Returning to step 306, if STQ controller 205 determines that a gatherable entry is not available, the store operation is allocated at the location defined by the current position of write pointer 204, as illustrated in step 308. The process then continues to step 310, which illustrates STQ controller 205 incrementing write pointer 204 to the next entry in store queue 207. The process then returns to step 304 and proceeds in an iterative fashion.
FIG. 3B is a high-level logical flowchart depicting the operation of read pointer 306 as implemented within STQ controller 205 when dispatching store operations to RC dispatch 219 in accordance with the prior art. The process begins at step 350 and proceeds to step 352, which illustrates STQ controller 205 determining if write pointer 204 is pointing to the same entry in store queue 207 as read pointer 206. If STQ controller 205 determines that write pointer 204 is pointing to the same entry in store queue 207 as read pointer 207, the process continues to step 354, which depicts STQ controller 205 determining if the entry is valid. If the entry is not valid, this indicates to STQ controller 205 that store queue 207 is empty and there are no valid entries available for dispatch to RC dispatch 219. Therefore, if STQ controller 205 determines that the entry is not valid, the process returns to step 352 and continues in an iterative fashion.
However, if STQ controller 205 determines that the entry is valid, the process then continues to step 356, which illustrates store queue 207 attempting a dispatch of the entry to RC machines 221. Returning to step 352, if STQ controller 205 determines that write pointer 204 is not pointing at the same entry in store queue 207 as read pointer 206, the process also continues to step 356. The process then continues to step 358, which depicts STQ controller 205 determining whether or not the dispatch of the entry was successful. At the completion of a successful dispatch, RC dispatch logic 219 sends a “dispatch complete” signal back to STQ controller 205. If STQ controller 205 receives a “dispatch complete” signal from RC dispatch logic 219, the process continues to step 360, which illustrates STQ controller 205 moving read pointer 206 to the next entry in store queue 207. Then, the process returns to step 352 and continues in an iterative fashion. Returning to step 358, if STQ controller 205 does not receive a “dispatch complete” signal from RC dispatch logic, the process returns to step 356 and continues in an iterative fashion.
Those skilled in this art will appreciate that the dispatch of store queue entries to RC machines can often be stalled by RC machines already working on the same address or snoop machines servicing requests from other processors. New RC machines typically cannot dispatch on the same address as another RC or snoop machine. The store queue will continue to iterate on the same entry until the entry is successfully dispatched, as indicated in steps 356 and 358 of FIG. 3B. This iterative loop can expend hundreds of cycles as the RC machine could be working to retrieve the requested cache line from main memory. The delayed dispatch of store queue entries to the RC machines can eventually result in a backup of the store pipe in the processor core, which could result in a stall of the entire processor. Therefore, the first-in-first-out (FIFO) nature of store queue 207 will not allow the dispatch of other store queue entries until out (FIFO) nature of store queue 207 will not allow the dispatch of other store queue entries until the current store queue entry referenced by read pointer 206 is dispatched.
The present invention recognizes the benefits of providing a more efficient dispatch of entries from a store queue to associated RC dispatch logic. This invention further recognizes that it would be desirable to provide a system and method for re-ordering dispatch requests for entries within a store queue to reduce the likelihood for the store queue to iterate on an entry while waiting for the entry to be successfully dispatched by the RC dispatch logic. These and other benefits are provided by the invention described herein.