Intermediate buffers, such as store data buffers (SDBs), have been included between an execution unit, such as a processor, and a cache. These intermediate buffers have been used to temporarily store data until the cache is ready to accept the data. Overall performance has been increased by making the store execution of data independent of the data cache access. Performance is increased because of the elimination of the latency associated with continuously toggling between reading and writing operations to cache. Data from executed store operations stays in the intermediate buffer until the data cache is ready to accept the data for writing. The corresponding store addresses have been temporarily stored in an intermediate Store Address Buffer (SAB) in parallel to the intermediate SDB.
Later loads may depend (in full or in part) on data previously written to the intermediate SDB which has not yet been written in the cache. As result, every load operation checked the intermediate buffers (SAB and/or SDB) to determine whether the intermediate buffer contained data needed by the load. If the specific data was updated multiple times in the intermediate buffers, multiple entries in intermediate buffers (SAB and/or SDB) would have been returned.
Existing systems used Loosenet Blocking Check followed by Carry Chain and then Finenet Check algorithms to identify these dependencies in an intermediate buffer. If these checks indicated that current load depends on data updated in the intermediate buffer, the data would be forwarded from the intermediate buffer to the current load. Loosenet Blocking Check checked a load's untranslated address and data size to see if any older data store in the intermediate buffer has modified the data requested by the load. A loosenet block occurs when one or more older data store entries in the intermediate buffer have an address conflict or other blocking condition that blocks the load. All SAB entries independently determine if they have a blocking condition (or loosenet hit) for each address that is to be checked against the load. Index bits of the linear address identified the cache line for the loosenet check. The offset bits indicated the starting point of the access within the line. The offset together with the data size was used to calculate the end point of the load/store access. Byte resolution was used to determine whether a matched load overlaps, underlaps, or exactly matches the intermediate bufffer SDB entry.
As stated previously, Loosenet Blocking Check was done for each load against every intermediate buffer SAB entry. The resulting loosenet hit vector was then processed through a Carry Chain algorithm to identify the most recently stored data in the intermediate buffer having its data overlap with the load. Time stamps were used in the Carry Chain algorithm to locate the most recently stored data. The intermediate buffer SAB was searched for the first loosenet hit entry in reverse chronological order based on the time stamps. For simplicity, a loosenet hit bit was computed for each entry of the intermediate buffer SAB, but the load only needs to be ordered against data stored on a same thread. The loosenet hit bits on opposite threads were masked out before searching for the loosenet hit entry.
If Loosenet and Carry Chain algorithms did not find a blocking store condition for the load, the load received the data directly from the data cache. Otherwise, the linear address of the most recently stored overlapping data (as determined by the Loosenet and Carry Chain algorithms) was read from the intermediate buffer SAB and the tag bits were compared against the tag bits of the load's linear address. If the tags matched, a finenet hit was found. If the load was a subset of this finenet hit, the corresponding data in the intermediate buffer SDB that produced the finenet hit was forwarded to the load. If the load was not a subset, the load must be stalled until the blocking condition was resolved.
The Carry Chain algorithm was optimized for machines that process data out-of-order and the algorithm is not easily adaptable to in-order, stall-on-use machines having transactional memory. For example, in in-order, stall-on-use machines, the intermediate buffers SAB and SDB could be filled in order and then de-allocated out of order provided that the machine supports transactional memory. Although the out of order de-allocation frees up more memory so that additional data can be stored in the intermediate buffer, the out of order de-allocation also creates bubbles in the intermediate buffers which would cause wrong prioritization results when applying the Carry Chain algorithm. Thus, the Carry Chain algorithm is not compatible with an out of order de-allocation.
Not only is the Carry Chain algorithm not compatible with an out of order de-allocation, but the Carry Chain algorithm is also resource intensive—in current state of the art processors, processing the Carry Chain algorithm has taken up to one-fifth of the complete load loop. Designs with shorter load loops are limited due to the timing constraints associated with the Carry Chain algorithm. Finally, these existing algorithms do not support partial data store forwarding, where a first part of the requested data is taken from an intermediate buffer SDB entry and a second part is from the data cache when the requested data only partially overlaps with the corresponding data stored in the intermediate buffer. In these instances, the machine must be stalled until this condition is resolved resulting in additional processing delays.
There is a need for more efficient algorithms that enable additional data to be stored in intermediate buffers, require less resources, and support partial forwarding.