1. Field of the Invention
This invention is related to the field of processors and, more particularly, to forwarding of data from a store buffer for a dependent load.
2. Description of the Related Art
Processors typically employ a buffer for storing store memory operations which have been executed (e.g. have generated a store address and may have store data) but which are still speculative and thus not ready to be committed to memory (or a data cache employed by the processor). As used herein, the term xe2x80x9cmemory operationxe2x80x9d refers to an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as xe2x80x9cloadsxe2x80x9d, and similarly store memory operations may be referred to as xe2x80x9cstoresxe2x80x9d. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a xe2x80x9cdata addressxe2x80x9d generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as xe2x80x9cinstruction addressesxe2x80x9d.
Since stores may be queued in the buffer when subsequent loads are executed, the processor typically checks the buffer to determine if a store is queued therein which updates one or more bytes read by the load (i.e. to determine if the load is dependent on the store or xe2x80x9chitsxe2x80x9d the store). Generally, the load address is compared to the store address to determine if the load hits the store. If a hit is detected, the store data may be forwarded in place of cache data for the load. Thus, it is desirable to detect the hit in the same amount of time, or less, than the time needed to access data from the cache.
Minimizing the load latency (e.g. the time from executing a load to being able to use the data read by the load) is key to performance in many processors. Unfortunately, comparing addresses may be a time-consuming activity since the addresses may include a relatively large number of bits (e.g. 32 bits, or even greater than 32 bits and up to 64 bits is becoming common). Thus, reducing the amount of time required to determine if loads hit stores in the buffer may result in increased performance of the processor, since this reduction may reduce the load latency. Alternatively, meeting the timing constraints for a given cycle time and given load latency may be eased if the amount of time used to compare the addresses is reduced.
The use of virtual addressing and address translation may create an additional problem for reducing the amount of time elapsing during a check of the load address against store addresses in the buffer. When virtual addressing is used, the data address generated by executing loads and stores is a virtual address which is translated (e.g. through a paging translation scheme) to a physical address. Multiple virtual addresses may correspond to a given physical address (referred to as xe2x80x9caliasingxe2x80x9d) and thus physical data addresses of loads and stores are compared to ensure accurate forwarding (or the lack thereof) from the buffer. Unfortunately, the physical address of the load is typically generated from a translation lookaside buffer (TLB) and thus is often not available until the cache access is nearly complete, further worsening the problem of detecting hits on the stores in the buffer in rapid but accurate fashion.
The problems outlined above are in large part solved by an apparatus for forwarding store data for loads as described herein. The apparatus includes a buffer configured to store information corresponding to store memory operations and circuitry to detect a load which hits one of the stores represented in the buffer. More particularly, the circuitry may compare the index portion of the load address to the index portions of the store addresses stored in the buffer. If the indexes match and both the load and the store are a hit in the data cache, then the load and store are accessing the same cache line. If one or more bytes within the cache line are updated by the store and read by the load, then the store data is forwarded for the load. Advantageously, the relatively small compare of the load and store indexes may be completed rapidly. Additionally, since most (if not all) of the index is typically physical (untranslated) bits, the comparison may be performed prior to the load address being translated without significantly impacting the accuracy of the compare.
In one embodiment, the circuitry speculatively forwards data if the load and store indexes match and the store is a hit in the data cache. Subsequently, when the load is determined to hit/miss in the cache, the forwarding is verified using the load""s hit/miss indication. In set associative embodiments, the way in which the load hits is compared to the way in which the store hits to further verify the correctness of the forwarding.
Broadly speaking, an apparatus is contemplated. The apparatus comprises a buffer and circuitry coupled to the buffer. The buffer includes a plurality of entries, wherein each of the plurality of entries is configured to store: (i) at least an index portion of a store address of a store memory operation, (ii) a hit indication indicative of whether or not the store memory operation hits in a data cache, and (iii) store data corresponding to the store memory operation. The circuitry is coupled to receive: (i) the index portion of a load address of a load memory operation probing the data cache, and (ii) a load hit signal indicative of whether or not the load memory operation hits in the data cache. The circuitry is configured to cause the store data to be forwarded from a first entry of the plurality of entries responsive to the index portion stored in the first entry matching the index portion of the load address and further responsive to the hit indication in the first entry indicating hit and the load hit signal indicating hit.
Additionally, a processor is contemplated comprising a data cache and a load/store unit coupled to the data cache. The load/store unit includes a buffer including a plurality of entries, wherein each of the plurality of entries is configured to store: (i) at least an index portion of a store address of a store memory operation, (ii) a hit indication indicative of whether or not the store memory operation hits in the data cache, and (iii) store data corresponding to the store memory operation. The load/store unit is configured to probe the data cache with a load address and to receive a hit signal in response thereto from the data cache. Additionally, the load/store unit is configured to determine that store data is to be forwarded from a first entry of the plurality of entries responsive to an index portion of the load address matching the index portion stored in the first entry and further responsive to the hit indication in the first entry indicating hit and the hit signal indicating hit.
Moreover, a method is contemplated. A data cache is probed with a load address. An index portion of the load address is compared to an index portion of a store address stored in a buffer. Store data corresponding to the store address is forwarded for a load memory operation corresponding to the load address. The forwarding is responsive to the comparing determining that the index portion of the load address matches the index portion of the store address and further responsive to both the load address and the store address hitting in a data cache.