Microprocessor performance may be increased within a computer system by enabling load operations to be satisfied from fast-access memory resources, such as cache, before resorting to computer system memory resources, such as Dynamic Random Access Memory (DRAM), which may require more time to access. Data or instructions stored within DRAM are typically organized along page boundaries requiring extra “open” and “close” memory cycles when accessed. Data and/or instructions may also be stored within cache memory, such as a Level 2 (L2) cache memory in order to facilitate faster access of frequently-used data.
Memory resources, such as DRAM and L2 cache, may be included as part of a computer system's memory hierarchy, in which data or instructions may be stored according to the frequency of their use. Data or instructions may then be accessed from or stored to these memory resources in various proportions in order to satisfy load and store operations efficiently.
In the case of a load operation, the decision of which memory resource to access within the system memory hierarchy depends upon where the most current version of the addressed data or instruction is located at a particular time. For example, a particular memory location addressed by a load operation may not have the “freshest” data at a particular time, since prior store operations may still be pending, which have not written their data to the memory location. Therefore, until the store operation updates the memory location addressed by the load operation, the load operation may access “stale” data causing incorrect results or errors in program operation.
Instead of waiting for fresh data to be stored within the computer system's memory hierarchy, load operations may be satisfied by accessing one or more store buffers in which store operations are temporarily stored before being executed by a processor and subsequently writing their data to a location within the computer system's memory hierarchy. By accessing a store operation from a store buffer, the load operation may be satisfied and program operation may continue with correct data.
However, load operations may depend on multiple store operations. Therefore, a load operation must be able to obtain data from the most recent (youngest) store operation that has been issued to a store buffer before the issuance of the load operation (i.e., The youngest store that is older than the load). Determining which store a load ultimately depends upon may require a large amount of hardware and several bus cycles to complete.
A prior art technique of determining which store a load ultimately depends upon uses a Carry Chain Algorithm (CCA) (FIGS. 1a and 1b) to perform a store prioritization. The CCA may be implemented with a carry look-ahead circuit similar to that used in a high-performance adder circuit. Furthermore, a CCA may be able to perform the store prioritization in Order (log N) levels of logic, where N is the number of store buffer entries in a particular store buffer.
However, one short-coming of the CCA is that its worst-case time to complete a store prioritization is approximately equal to its best-case time. This is because the carry bits are propagated through the adder in parallel with the sum bits. While this may be acceptable for some sizes of store buffers, it can be detrimental to overall system performance as the store buffer size is increased. It is generally desirable to increase the size of store buffers within a super scalar microprocessor to the extent that it is economically viable to do so. Increasing the size of store buffers within a microprocessor reduces the number of cases in which a load must resort to system memory to retrieve data, and therefore decreases the cycle time overhead associated with accessing system memory.
Advantageously, store operations used by many software applications may be stored in close proximity to one another in a store buffer, since they are often executed as part of a modular software function within the application. Therefore, the youngest store operation upon which a load operation depends may be stored in close proximity to the first store that is older than the load operation (i.e., load color). In this case, searching all store buffer entries for the youngest store operation upon which a load operation depends, as in the prior art, can be an inefficient use of processor resources.