A processor (also commonly referred to as a central processing unit (CPU)) is a component in a computer that executes instructions of a program. In general, processor instruction execution may be broken down into three main tasks: 1) loading (or reading) data into registers from memory (e.g., a cache); 2) performing arithmetic operations on the data; and 3) storing (or writing) the results of the arithmetic operations to memory or input/output (I/O).
Of the tasks above, the first task—loading data (where data that is loaded from memory referred to as a “load”) into registers from memory—has the most impact on processor performance; as the second task cannot begin until the first task is complete. The third task—storing results (where data that is stored to memory is referred to as a “store”) to memory or I/O—is the most flexible as to the latency of its completion. Thus, when both a load and a store simultaneously attempt to access a cache during the same processor execution cycle, the load is typically allowed access to the cache, while the store must wait for the next processor execution cycle. Accordingly, in a circumstance in which multiple loads need to access a cache, a store may have to wait a few processor execution cycles before being granted access to the cache. Stores are therefore typically stored in a queue (commonly referred to as a “store queue”) while the stores wait for access to the cache.
In processor designs, a store queue can be a FIFO (First In, First Out) or a non-FIFO. Non FIFO store queues (also referred to as “out of order (OoO) store queues”) permit younger (newer) stores to be retired (i.e., data associated with the store is written into cache) before older stores are retired. Out of order store queues introduce additional complexity relative to FIFO store queues, but typically yield higher performance. For example, if the retirement of a particular store needs to be delayed for some reason, an out of order store queue may retire a younger store as long as there is no data ordering dependency between the delayed store and the younger store.
In some situations, two stores may be going to the same (cache) address and therefore the two stores must be retired in a particular order with respect to each other. This creates a store ordering hazard, which may introduce data integrity problems if a younger store going to a given address is retired before an older store going to the same address. The two stores may still be retired out of order relative to other stores in the store queue. In an out-of-order case, the younger store sets a dependency vector bit to indicate a dependency with a corresponding older store. During each processor execution cycle, the store performs a reduction OR operation across its dependency vector bits; if any of the dependency vector bits is set (e.g., equal to 1), then the store must wait for the next processor execution cycle for retirement. In some situations, a particular store must wait for a plurality of older stores to retire before the store can be retired—e.g., a sync or a store that may be going to the same address as several other stores in the store queue. In such cases, a younger store sets a dependency vector bit for each older store that must be retired prior to the store. As the older stores are retired, the corresponding dependency vector bits are cleared, and when a reduction OR finds that no dependency vector bits are set, the store is eligible to be retired.
Each store queue entry typically includes a dependency vector field that includes dependency vector bits, which indicate dependencies of the store to other stores in the store queue. Each dependency vector bit corresponds to a particular entry in the store queue. Thus, for an 8-entry store queue, the dependency vector includes an array of 8 entries by 8 bits. While the dependency vector may be manageable for an 8 entry store queue, as the depth of the store queue increases, the storage associated with the dependency vectors increases with the square of the number of entries in the store queue. Larger dependency vectors require a larger number of latches, and these latches consume area and power.
Accordingly, what is needed is an improved method and system for processing data. The present invention addresses such a need.