1. Technical Field
The present invention relates generally to processor chips and specifically to processing store operations within a processor chip. Still more particularly, the present invention relates to speculative issuance of store operations to a store queue within a processor chip.
2. Description of the Related Art
Increasing efficiency of data operations at the processor-cache level is an important aspect of processor chip development. Modern microprocessors typically contain entire storage hierarchies (caches) integrated onto a single integrated circuit. For example, one or more processor cores containing L1 instruction and/or data caches are often combined with a shared on-chip L2 cache.
In systems with on-chip caches, processor-issued data store operations typically target only a small portion (i.e., 1-byte to 16-bytes) of a cache line compared to the entire cache line (e.g., typically 128-bytes). For example, it is possible for a processor-issued store operation to target only a single byte granule of a 128-byte cache line to update, and cache line updates are completed via a combination of these individual store operations, which may occur sequentially. In order to increase efficiency, processor chips are often designed with a “store queue” that is typically placed between a processor core and the L2 cache. A store queue typically contains byte-addressable storage for a number of cache lines (usually 8 to 16 cache lines).
FIG. 2A illustrates a prior art representation of specific hardware and logic components of a processor chip that are utilized to complete data store operations. As illustrated, processor chip 201 includes a processor core 203 and store queue mechanism 240, which includes store queue 207 and store queue (STQ) controller 205. RC mechanism 225 includes RC dispatch logic 219 and associated RC machines 221. STQ controller 205 includes arbitration logic 206 utilized for selecting entries from the store queue 207, as described below. RC dispatch logic 219 supports a series of RC machines 221, which complete the actual data store operations at the lower-level cache (not shown).
The store queue 207 provides several rows (entries) for temporarily storing and tracking processor-issued stores. Each row is divided into several columns that provide byte enable register 209, address register 211, data register 213, controls bits 215, and valid bit 217. Data register 213 and address register 211 store data issued from the processor core 203 and the corresponding data (memory) address, respectively. Byte enable register 209 includes a number of bookkeeping bits. Conventionally, the number of bits corresponds to the number of individually addressable storage granules within a cache line. Thus, for example, for a 128-byte cache line and byte-size store operations, byte enable register 209 maintains 128 bits for tracking single-byte processor-issued stores within the entry (i.e., a buffer that temporarily holds data of one or more store operations that update the same target cache line). This enables tracking of specific bytes (or group of bytes) within a 128-byte cache line entry that is being updated by the processor.
Valid bit 217 indicates to STQ controller 205 when data within a particular row of the store queue 207 is valid, and valid bit 217 is checked before arbitration logic 206 selects a row of data (or an entry) to forward to RC Dispatch logic 219. Once a valid bit is set, arbitration logic 206 is able to select the entry regardless of whether additional stores to that cache line are being sent by the processor core. Control bits 215 represent an assortment of additional bits that are utilized by STQ controller 205.
Store operations typically originate at the processor core 203 and are temporarily stored in an entry of the store queue 207 until dispatched to the lower level (L2) cache for storage. The store operations target a particular cache line (or portion of the cache line) identified by the address within the store operation, and the operation also provides data to be stored within the addressed portion of that cache line (e.g., byte 12).
The store operations update particular bytes within the cache line entry. Concurrent with these data updates, corresponding bits within byte enable register are set to track which bytes within the cache line entry have been updated by store operations. Typically, a series of store operations writing to a same entry in the store queue are absorbed by the store queue entry before the line is dispatched to the L2 cache. This absorption of multiple store operations into a single entry is referred to as “gathering” stores, since multiple different stores addressing the same cache line are “gathered” into an entry of the store queue buffer before the line is presented to the L2 cache for storage. The gathering of stores allows many different store operations targeting a given cache line to be absorbed by the store queue before the entry is sent to update the L2 cache.
When a cache line entry is removed from the store queue to be sent to the L2 cache, the cache line entry is sent to an RC dispatch and assigned to an RC state machine, which updates the cache line in the L2 cache with the data from within the entry. Thus, for every RC machine assigned to a store operation, the entire cache line must be read and manipulated regardless of how many bytes of the cache line are actually being updated. It is thus more efficient to absorb multiple stores in the store queue entry before passing the line onto the L2 cache. Gathering stores also reduces the number of RC machine tenures required for store operations to a single cache line and also reduces the time required to update a cache line when multiple processor-issued stores update individual portions of the same cache line.
An existing entry is usually available for gathering when the entry holding previously issued store operation(s) for the same cache line address has not yet been selected for dispatch to an RC machine. In conventional implementations, once an entry in the store queue has been assigned to a target cache line, subsequent stores targeting that cache line are gathered within that entry until a condition occurs that prevents further gathering of store operations to that entry. The STQ controller 205 controls when stores to a cache line are allowed to gather. For example, the STQ controller may prevent further gathering of stores to an entry when the entry is selected for dispatch. Also, a gather is typically stopped when a barrier operation is encountered, as is known to those skilled in the art.
Returning now to FIG. 2A, in addition to the above described hardware components, processor chip 201 of FIG. 2A also includes a core interface unit (CIU) 230, which is located between the store queue controller 205 and the core 203. CIU 230 keeps track of the number of entries the store queue is currently using. A “store_gathered” signal 236 is sent to the CIU 230 from the store queue controller 205 when an issued store operation gathers into an existing entry (i.e., did not use up a new entry). Also, a “pop” signal 238 is sent to the CIU 230 from the store queue controller 205 when a store operation is forwarded to the L2 cache, allowing the entry to be re-assigned to a new set of store operations. Store_Req 231 tells the CIU 230 when the store queue 207 may assign an entry to a new store operation. With these signals, the CIU 230 is able to keep track of the number of entries the store queue is currently utilizing. The CIU 230 communicates with the processor core 203 via a “store busy” (or “store full”) signal 234 that informs the core's issuing logic 245 when the store queue 207 is full. The store queue full signal is asserted (e.g., a logic high “1”) when all the entries are being used and de-asserted (e.g., a logic low “0”) when there is an available entry in the store queue.
With this configuration, the store queue 207 sends these handshake signals to the CIU 230, and the CIU 230 sends store busy signals to the core 203. The processor core 203 thus detects a “queue full” condition when the core receives a store full signal 234 from the CIU 230 indicating that all store queue entries are being utilized.
Associated with CIU 230 is a mechanism for counting the number of entries being utilized within the store queue. This mechanism is referred to as the entry tracking logic (ETL). ETL 232 keeps track of how many store queue entries are currently being used, and the ETL 232 also signals the core to stop issuing more store operations to the store queue by informing the core when the store queue is full. In conventional systems, the ETL 232 tracks when there are available entries to assign to store operations being issued by the core. The core is thus able to suspend issuance of store operations when those operations cannot be buffered within the store queue. In the processor design of FIG. 2A, the ETL 232 is located within the CIU 230.
In other processor chip designs, a simple counter is provided to assist in tracking the number of entries being utilized within the store queue. The counter is located in either the processor core itself or in the store queue mechanism. Some comparative logic is provided to compare the count value against the threshold value that indicates the store queue is full. When the counter is within the core, then the cores simply stops issuing new store operations when the counter indicates all the entries are being utilized. When the counter is located within the store queue, the store queue sends a full signal to the core, which causes the core to stop issuing store operations.
These other processor designs are illustrated by FIGS. 2B and 2C, which respectively provide the entry tracking logic 232 as a part of either the core 203 or the store queue mechanism 240. With the processor design of FIG. 2B, where the entry tracking logic 232 is located within (or associated with) the store queue mechanism 240, all tracking occurs at the store queue and a busy signal 234 is sent up to inform the core 203 when to stop/suspend issuing store operations. However, with the processor design of FIG. 2C where the entry tracking logic 232 is located within the core 203, the handshake signals, including pop signal 238 and store_gathered signal 236 are sent on the interface to the core.
Notably, a series of secondary processes are required for determining whether there is an available entry within the store queue. The particular process utilized is dependent on the configuration of the processor chip, with respect to location of the ETL. FIGS. 3A and 3B, described below, both assume a processor chip design similarly to that of FIG. 2A.
FIG. 3A illustrates a general process by which processor core 203 of FIG. 2A determines when to issue store operations to the store queue. The process begins at step 301 and proceeds to step 303 at which a determination is made whether there is a new store operation to be issued. When there is a new store operation to be issued, a next determination is made at step 305 whether the store queue is full. Determining when the store queue is full involves a check of whether the store queue full signal is asserted. In all prior art implementations, the store queue asserts the full signal as soon as a last available entry receives a single store operation. Thus, the “full” store queue may actually include one or more partially full entries that may still be able to gather store operations and for which store operations are still being generated within the execution pipeline.
If the store queue is full (i.e., the store queue full signal is asserted), then the processor withholds (i.e., stalls/suspends) issuance of additional store operations at step 306 until the core is signaled that the store queue has an available entry (i.e., the store queue full signal is de-asserted). If the store queue is not initially full, the processor issues the store operation to the store queue, as shown at step 307.
FIG. 3B illustrates a general process by which a store queue mechanism 240 handles a store operation that is received from the processor core 203 within a processor chip configured similarly to FIG. 2A. The process begins at step 310 and proceeds to step 311 at which a store operation is received at the store queue from the processor core. A determination is made at step 313 whether the store queue is currently holding a gatherable entry for the cache line that is the target of the store operation. The store queue checks the addressing component of the store operation and examines the address associated with each active entry. When an entry assigned to the target cache line is present in the store queue and the entry is allowed to gather, the entry gathers the store operation, as shown at step 315.
Once an entry gathers the store operation, the store queue controller 205 asserts the store_gathered signal, which is transmitted to the CIU, as shown at step 317. If the target cache line is not represented by an entry within the store queue, the store operation is allocated to the unused entry (i.e., an entry that is not currently being utilized for a different target cache line) as shown at step 319. The process then ends at step 321.
Assertion of the store full signal causes the processor core to suspend issuance of store operations to the store queue until an entry becomes available and the store full signal is de-asserted (steps 305-306 of FIG. 3A). All subsequent store operations are held at the processor core (i.e., prevented from being issued to the store queue) until an entry becomes available in the store queue. An entry becomes available when the contents of that entry are dispatched to an RC machine and sent to the L2 cache for storage therein. A variety of different policies may be utilized to determine when cache lines are moved from the store queue to be stored in the L2 cache.
Current design of store queues provided only a very limited capacity (e.g., a maximum of 6-8 entries). With this limited capacity, current methods of suspending processor issuance of store operations to store queues when the full signal is asserted has several obvious limitations that lead to processor bottlenecks during store operations. For example, in conventional systems, the full signal is asserted as soon as the last available entry is assigned to an issued store operation. Thus, a single byte store assigned to last entry prevents the core from issuing additional store operations to the store queue.
There is no accounting in conventional systems for execution of code that provides multiple stores that may be gathered into that last entry or into a previously assigned entry. For example, with certain types of code, such as scientific codes, it is quite common for a stream of store operations to target (and be gathered into) a single entry. With conventional systems, however, the store queue immediately asserts the “full” signal when the first of a stream of store operations addressing the same cache line hits the store queue and is assigned to the last available entry of the store queue.
Since the store operations are generated and issued faster than the store queue can request to the RC machine, a bottle neck of sorts occurs and the processor has to suspend issuance of other store operations even when those operations could possibly gather into an existing entry of the store queue. The core is made to wait until the store queue pops one or more of the entries before the core can resume issuing store operations.
Because a regular store operation does not update an entire cache line, but only a fractional portion thereof, at least the last entry to be used in the store queue will only have a fraction of the cache line that could possibly gather into the queue entry although the counter or ETL indicates the store queue to be full. The next store operation is not issued and may not gather into one of the currently used entries until one of the entries is popped from the store queue. In conventional systems, the core will simply stall and wait for the busy signal to be de-asserted or some kind of handshake signal that tells the core the store queue is no longer full (e.g., when the store queue pops an entry to the L2 cache).
The present invention recognizes that it would be desirable to provide a method and system that would enable a processor to continue issuing store operations that may be gatherable into one of the entries of a full store queue. The invention further recognizes that it would be desirable to provide a method and system for speculatively issuing store operations to the store queue to remove the bottlenecks inherent with conventional systems which suspend issuance of store operations whenever the store queue is full. These and other benefits are provided by the invention described herein.