Microprocessor performance may be increased within a computer system by enabling load operations to be satisfied from fast-access memory resources, such as cache, before resorting to computer system memory resources, such as Dynamic Random Access Memory (DRAM), which may require more time to access. Data or instructions stored within DRAM are typically organized along page boundaries requiring extra “open” and “close” memory cycles when accessed. Data and/or instructions may also be stored within cache memory, such as a Level 2 (L2) cache memory in order to facilitate faster access of frequently-used data.
Memory resources, such as DRAM and L2 cache, may be included as part of a computer system's memory hierarchy, in which data or instructions may be stored according to the frequency of their use. Data or instructions may then be accessed from or stored to these memory resources in various proportions in order to satisfy load and store operations efficiently.
In the case of a load operation, the decision of which memory resource to access within the system memory hierarchy depends upon where the most current version of the addressed data or instruction is located at a particular time. For example, a particular memory location addressed by a load operation may not have the “freshest” data at a particular time, since prior store operations may still be pending, which have not written their data to the memory location. Therefore, until the store operation updates the memory location addressed by the load operation, the load operation may access “stale” data causing incorrect results or errors in program operation.
Instead of waiting for fresh data to be stored within the computer system's memory hierarchy, load operations may be satisfied by accessing one or more store buffers in which store operations are temporarily stored before being executed by a processor and subsequently writing their data to a location within the computer system's memory hierarchy. By accessing a store operation from a store buffer, the load operation may be satisfied and program operation may continue with correct data.
However, load operations may depend on multiple store operations. Therefore, a load operation must be able to obtain data from the most recent (youngest) store operation that has been issued to a store buffer before the issuance of the load operation (i.e., The youngest store that is older than the load). Determining which store a load ultimately depends upon may require a large amount of hardware and several bus cycles to complete.
A prior art technique of determining which store a load ultimately depends upon employs a Carry Chain Algorithm (CCA) to perform a store prioritization, as illustrated in FIGS. 1a and 1b. The CCA in FIGS. 1a and 1b can be used to search an entire 64 entry store buffer and indicate which store buffer entry group should be read out to the read port, based on the location of the youngest store upon which a load depends. The CCA may be implemented with a carry look-ahead circuit similar to that used in a high-performance adder circuit. Furthermore, a CCA may be able to perform the store prioritization in order (log N) levels of logic, where N is the number of store buffer entries in a particular store buffer.
The CCA-64 of FIGS. 1a and 1b is composed of a level of 4-bit CLA (CLA-4) blocks 101 that compute propagate (P) and generate (G) signals, which are inputs to the next CLA-4 level 105. The P and G signals travel up the tree, until the top “special wrap” level is reached, at which point the P and G signals are used to compute carry (C) bits. The carries propagate down the tree, with each CLA-4 level computing additional carries. All of the carries are available when the bottom of the 64-bit CCA tree is reached.
The logic used in each CLA-4 block and the wrap block is described by the equations 115 of FIGS. 1a and 1b, where P corresponds to non-matching CAM vector entry positions, G indicates the load color position within a CAM vector, and C indicates the CAM vector position of a matching target address of the youngest store operation older than a load operation being processed.
The special wrap logic 110 is similar to that used in the CLA-4 blocks, with a modification to allow carries to wrap around the end of the CCA and provide the “carry-in” at position 63. This is to allow a search to proceed around the end of a circular store buffer, such as a circular fast store-forwarding buffer.
In the prior art, one carry look-ahead CCA was used to perform an ordered search on all store buffer entries. However, one short-coming of this approach is that the CCA's worst-case time to complete a store prioritization is approximately equal to its best-case time. This is because the carry bits are propagated through the adder in parallel with the sum bits. While this may be acceptable for some sizes of store buffers, it can be detrimental to overall system performance as the store buffer size is increased.
It is generally desirable to increase the size of store buffers within a super scalar microprocessor to the extent that it is economically viable to do so. Increasing the size of store buffers within a microprocessor reduces the number of cases in which a load must resort to system memory to retrieve data, and therefore decreases the cycle time overhead associated with accessing system memory.
Another concern arises when the desired data is ultimately identified and read out of a store buffer entry to be used by a load operation. Identifying and subsequently reading data from the store buffer entry can, in some cases, gate other pending operations along a microprocessor's critical path. The prior art is, therefore, further limited by the size of store buffers that may be searched due to the amount of time necessary to service a load on the microprocessor's critical path.