The invention herein is related to design of microprocessors, and in particular, to techniques for addressing cache loads waiting prior to cache write backs.
As microprocessor design continues to advance, new problems arise. Consider, for example, an existing (prior art) architecture, aspects of which are depicted in FIG. 1. First, note that FIG. 1 is a simplified depiction for purposes of discussion and does not take into account that each microprocessor 1 may include pluralities of various components.
The microprocessor 1 typically includes components such as one or more arithmetic logic units 2, one or more caches 5, and a plurality of registers 6. Typically, the cache 5 provides an interface with random access memory 11. Of course, different configurations of other components are typically included in the microarchitecture, but are not shown here for simplicity.
Most processors run programs by loading an instruction from memory 11 and decoding the instruction; loading associated data from registers or memory 11 that is needed to process the instruction; processing the instruction; and storing any associated results in registers or memory 11. Complicating this series of steps is the fact that access to the memory 11, which includes the cache 5, main memory (i.e., random access memory 11) and other memory such as non-volatile storage like hard disks, (not shown) involves a lengthy delay (in terms of processing time).
One technique to improve performance is the use of “pipelining.” Pipelines improve performance by allowing a number of instructions to work their way through the microprocessor at the same time. For example, if each of the previously mentioned 4 steps of running programs is implemented as a pipeline cycle, then microprocessor 1 would start to decode (step 1) a new instruction while the last instruction waits for results to continue. This would allow up to four instructions to be “in flight” at one time, making the microprocessor 1 appear to be up to four times as fast. Although any one instruction takes just as long to complete (there are still four steps) the microprocessor 1 as a whole “retires” instructions much faster and can be run at a much higher clock speed than in prior designs.
Unfortunately, in a pipelined microprocessor 1, a special condition exists. This condition is commonly referred to as “load-hit-store” (and also known as “operand store compare”). In load-hit-store, a load (also referred to herein as a “fetch” or as a “read”) from memory 11 (step 2 above) designates an address in memory that is the same as an address designated by a store (also referred to herein as an “update” or a “write”) to memory 11.
In load-hit-store, the most recent value intended for storing in an address location is not available for use in the load. That is, the data required for the load may not yet be stored in the address of the memory 11 or in the cache 5 and may be in progress, elsewhere in the microprocessor 1.
Previous designs have attempted to minimize delays due to load-hit-store conflicts by using store forwarding mechanisms to allow loads to reference store data result values before they are written into the cache 5. Thus, such designs attempt to solve this problem without requiring loads to wait for either the cache 5 or memory 11 to be written before they execute. Consider three examples provided below that relate to store forwarding or load-hit-store handling.
A first example is provided in U.S. Pat. No. 6,678,807, entitled “System and method for multiple store buffer forwarding in a system with a restrictive memory model” and issued on Jan. 13, 2004. This patent discloses use of multiple buffers for store forwarding in a microprocessor system with a restrictive memory model. In an embodiment, the system and method allow load operations that are completely covered by two or more store operations to receive data via store buffer forwarding in such a manner as to retain the side effects of the restrictive memory model thereby increasing microprocessor performance without violating the restrictive memory model.
A further example is that of U.S. Pat. No. 6,393,536, entitled “Load/store unit employing last-in-buffer indication for rapid load-hit-store,” and issued on May 21, 2002. This patent discloses a load/store unit that includes a buffer configured to retain store memory operations which have probed the data cache. Each entry in the buffer includes a last-in-buffer (LIB) indication which identifies whether or not the store in that entry is the youngest store in the buffer to update the memory locations specified by the corresponding store address. Load addresses are compared to the store addresses, and the comparison result is qualified with the corresponding LIB indication such that only the youngest store is identified as a hit. At most one load hit store is detected.
The third example is provided in U.S. Pat. No. 6,581,151, entitled “Apparatus and method for speculatively forwarding storehit data based on physical page index compare,” and issued on Jun. 17, 2003. This patent describes a speculative store forwarding apparatus in a pipelined microprocessor that supports paged virtual memory. The apparatus includes comparators that compare only the physical page index of load data with the physical page indexes of store data pending in store buffers to detect a potential store-hit. If the indexes match, forwarding logic speculatively forwards the newest store-hit data based on the index compare. The index compare is performed in parallel with a TLB lookup of the virtual page number of the load data, which produces a load physical page address. The load physical page address is compared with the store data physical page addresses to verify that the speculatively forwarded store-hit data is in the same page as the load data. If the physical page addresses mismatch, the apparatus stalls the pipeline in order to correct the erroneous speculative forward. The microprocessor stalls until the correct data is fetched.
Prior solutions to load-hit-store conflicts using store forwarding have had difficulties with certain types of overlap between the load memory areas and store memory areas. The exemplary patents above either describe restrictions on the memory area overlap between loads and stores for allowing store forwarding, do not mention these restrictions, or do not attempt to address solutions for avoiding these restrictions at all. The following example demonstrates a load-hit-store memory overlap condition that prior art store forwarding designs cannot or did not attempt to resolve with store forwarding.
Suppose there is a store A instruction that stores to 4 bytes in address locations 0, 1, 2, and 3. This store A instruction is followed closely by a load B instruction that loads 4 bytes from address locations 2, 3, 4, and 5. (Note that address location 5 is not to be confused with the reference numeral used to designate the cache 5). If the store A has not yet updated the cache 5 or memory 11 at the time that load B requires the data, then there is a load-hit-store condition. This particular load-hit-store condition only exists for address locations 2 and 3. Locations 0 and 1 stored to by store A are not needed by load B. Also, the 2 bytes loaded by load B in address locations 4 and 5 are not stored to by store A. Not only does store A not store to addresses 4 and 5, in previous designs the structures holding the data for store A would have no record of the values of locations 4 and 5. So, for load B to get all of the bytes it needs, it must get locations 2 and 3 from store A using store forwarding, and locations 4 and 5 from somewhere else (usually this would be the cache 5). In prior art, this type of “partial overlap” between store A and load B is a violation of the restrictive memory model used and store forwarding is not allowed because there is no mechanism to determine which pieces of data should be forwarded from the store and which pieces of data need to be forwarded from the cache 5. A mechanism to effectively forward parts of load data from different sources does not currently exist. The existing or prior art restrictive memory model assumes that either all data is forwarded from a single store structure or no data is forwarded and all data is accessed normally from the cache. Any case of store and load overlap that cannot be resolved by either of these two methods will result in a load-hit-store penalty (load must wait for the previous stores, that the load depends upon, to write the data into the cache).
An additional example of a violation of prior art restrictive memory models would be store A to locations 0, 1, 2, and 3 followed by a store B to locations 2 and 3, followed by a load C to locations 0, 1, 2, and 3. If stores A and B have not yet updated the cache 5 at the time load C needs to load its data from the cache, there is a load-hit-store condition. Though store A does cover the exact same locations as load C it would be incorrect to forward all the bytes from store A since store B is more recent than store A, so locations 2 and 3 should be forwarded from store B while locations 0 and 1 are forwarded from store A. Prior art solutions would be able to handle the condition where there is a store A and load C without store B, but having store B in the middle violates the standard restrictive memory model used for store forwarding. As a result, the load must take a load-hit-store penalty. In order to avoid strict memory area overlap based restrictions on store forwarding, a new solution is required.
What are needed are solutions to overcome situations where the most recently updated value for an address location from a respective store is not available for the load to use including cases where store data only partially overlaps with load data and cases where multiple stores may partially overlap with the load data and partially overlap with each other.