In the context of a pipelined microprocessor design, a situation can occur such that a newer load operation overlaps with an older store operation in memory. That is, one or more bytes of the store data are destined for a memory location that is also within the source memory location specified by the load. FIG. 1 shows an example in which a 32-bit (dword) store operation to memory address 0x1234 is followed in program sequence by a 16-bit (halfword) load operation from memory address 0x1236. Because the load operation specifies at least one byte of its source memory location that is the same as the destination memory location specified by the store operation, they overlap.
In some cases, the microprocessor is able to forward the store data directly to the load, which is referred to as store forwarding operation, and is generally advantageous to performance. However, in other cases, the load has to wait on the store to commit to memory to get the data, which is referred to as a load-store collision situation. For example, the load and store may not both be to memory regions with a write-back cacheable memory trait, or the load is not able to receive all of its data from the store or a combination of the store and cache. A load-store collision is generally disadvantageous to performance, but must be detected.
Typically, to determine load-store collisions, processors do some type of address compare combined with byte overlap detection. The address compare is typically a cache line compare or sometimes a cache index, but could be finer granularity. The finer the granularity, the larger the address comparators required, which typically translates into more power consumed and more time required for the comparators to do the compare. Some implementations compare cache line addresses. This means the byte overlap detection needs to calculate whether any bytes of the load collide with any bytes of the store within the 64-byte cacheline of which the address compare generates a match. (See FIG. 2 for illustration of the 16 dwords of a 64-byte cache line, each dword containing 4 bytes aligned on a 4-byte address boundary.) As an example, the largest memory operation (i.e., load/store, also referred to herein as a memop) size of a microarchitecture may be 16 bytes, which is smaller than the size of a cacheline. Conventionally, the byte overlap detection would be accomplished by generating or storing a 16-bit byte mask for each memop (each byte implicated by the load or store has its corresponding bit set), shifting the byte mask of the load and store to their respective position within the cacheline, and then checking to see if any bits of the two byte masks overlap, which indicates that the load and store specify at least one common byte position within the cache line. If so, a load-store collision condition exists.
The conventional byte overlap detection scheme would do one of the following for every memop: (1) store 64-bit byte masks that already have the 16-bit byte masks pre-shifted to their proper position within the cache line; (2) store 16-bit byte masks and shift them on the fly; or (3) generate 16-bit byte masks and shift on the fly. Each of these conventional byte overlap detection schemes has its pluses and minuses, but they all have problems. Storing the pre-shifted byte masks (scheme 1) requires a relatively large amount of hardware. Generating and shifting the byte masks on the fly (scheme 3) introduces timing problems and can have negative power implications. Scheme 2 is a compromise that still introduces timing problems because it requires shifting on the fly, although less than scheme 3 because it does not require generating the byte masks on the fly, and still requires the additional hardware to store the byte masks, although less than scheme 1. A potential 1-to-48 position shift and then a 64-bit mask compare operation is a significant amount of hardware and may be a timing issue. Generally speaking, dealing with a full 64-bit cacheline vector is a problem.