A processor (also commonly referred to as a central processing unit (CPU)) is a component in a computer that executes instructions of a program. In general, processor instruction execution may be broken down into three main tasks: 1) loading (or reading) data into registers from memory (e.g., a cache); 2) performing arithmetic operations on the data; and 3) storing (or writing) the results of the arithmetic operations to memory or input/output (I/O).
Of the tasks above, the first task—loading data (where data that is loaded from memory referred to as a “load”) into registers from memory—has the most impact on processor performance as the second task cannot begin until the first task is complete. The third task—storing results (where data that is stored to memory is referred to a “store”) to memory or I/O—is the most flexible as to the latency of its completion. Thus, when both a load and a store simultaneously attempt to access a cache during a same processor execution cycle, the load is typically allowed access to the cache, while the store must wait for a next processor execution cycle. Accordingly, in a circumstance in which multiple loads need to access a cache, a store may have to wait a few processor execution cycles before being granted access to the cache. Stores are therefore typically stored in a queue (commonly referred to as a “store queue”) while the stores wait for access to the cache.
In processor designs, a store queue can be a FIFO (First In, First Out) or a non-FIFO. Non FIFO store queues (also referred to as “out of order (OoO) store queues”) permit younger (newer) stores to be retired (i.e., data associated with the store is written into cache) before older stores are retired. Out of order store queues introduce additional complexity relative to FIFO store queues, but typically yield higher performance. For example, if the retirement of a particular store needs to be delayed for some reason, an out of order store queue may retire a younger store as long as there is no data ordering dependency between the delayed store and the younger store.
In some situations, two stores may be going to the same (cache) address and therefore the two stores must be retired in a particular order with respect to each other. In addition, the two stores may still be retired out of order relative to other stores in the store queue. In an out-of-order case, the younger store sets a dependency vector bit to indicate a dependency with a corresponding older store. During each processor execution cycle, the store performs a reduction OR operation across its dependency vector bits; if any of the dependency vector bits is set (e.g., equal to 1), then the store must wait for a next processor execution cycle for retirement. In some situations, a particular store must wait for a plurality of older stores to retire before the store can be retired—e.g., a sync or a store that may be going to the same address as several other stores in the store queue. In such cases, a younger store sets a dependency vector bit for each older store that must be retired prior to the store. As the older stores are retired, the corresponding dependency vector bits are cleared, and when a reduction OR finds that no dependency vector bits are set, the store is eligible to be retired.
As long as a store remains valid in the store queue (STQ), the store data typically has not yet been written to the cache. If the processor were to send a load request for any byte addresses that are valid in the STQ, then the load must not be allowed to satisfy its request from the cache. Although the cache may report a ‘hit’ for the line address targeted by the load, the data it contains is stale if the store queue has any bytes for that line; any data that may be found in the STQ is always newer than data in the cache. And so, when a load comes along, it typically performs address compares against the valid, older entries in the STQ to determine whether it may use the data that the cache contains or whether it must wait for a store to complete before it may satisfy its request.
There are various means used to detect and to track load-store ordering hazards. If the store queue (STQ) always retires (i.e., completes) stores in age order, the load queue (LDQ) may force every new load to wait for the most recent store in the STQ to complete by just remembering the most recently allocated STQ entry; when that entry is retired, any potential hazard the load would have had is guaranteed to have been resolved. However, this method penalizes all loads, not just loads that have an ordering hazard with a store.
Alternatively, a second method is for the LDQ to make a new load wait for the most recent store if the load has an ordering hazard with any store in the STQ. This allows better performance than the previously described method because only loads that have actual hazards need be delayed by the STQ. However, this method causes a load to wait longer than it may otherwise need to wait because it waits for the most recent store, even when its hazard is the oldest store in the STQ.
Alternatively, a third method is for the LDQ to wait for the youngest STQ entry that it has an ordering hazard with. This offers still better performance than the previously described methods. However, in the case of the load having an ordering hazard with multiple STQ entries, the hazard logic must endure the complexity of assigning a compare result priority based on the age of the STQ entry relative to the other entries in order for the load to know which STQ entry must retire before it may continue.
A fourth method is for the LDQ to continue retrying the load until it no longer detects the ordering hazard with the STQ. This offers reduced complexity versus the second and the third methods described above. However, this is not energy efficient because the loads keep cycling around until the ordering hazard resolves, and this may reduce the throughput of stores from the STQ because each time a load retries it potentially prevents a store from accessing the cache due to the higher relative priority usually assigned to loads versus stores.
If the STQ allows stores to retire out of order (OoO) with respect to each other if to different target addresses, the LDQ's options for tracking load-vs-store ordering hazards are more limited. Because the LDQ does not know whether the youngest store in the STQ at the time of the load's arrival will be the last STQ entry to retire, the LDQ is not able to use any of the in-order STQ methods that rely on the most recent store to enter the STQ. The LDQ may retry the load until it no longer detects the ordering hazard with the STQ. This offers reduced complexity versus the second and the third methods described above. However, this is not energy efficient because the loads keep cycling around until the ordering hazard resolves, and this may reduce the throughput of stores from the STQ because each time a load retries it potentially prevents a store from accessing the cache due to the higher relative priority usually assigned to loads versus stores.
Accordingly, what is needed is an improved method and system for processing data. The present invention addresses such a need.