The disclosed subject matter relates generally to processing systems and, more particularly, to promoting transactions that hit the critical beat of a pending cache line load requests.
Processing systems utilize two basic memory access instructions or operations: a store instruction that writes information that is stored in a register into a memory location and a load instruction that loads information stored at a memory location into a register. High-performance out-of-order execution microprocessors can execute memory access instructions (loads and stores) out of program order. For example, a program code may include a series of memory access instructions including loads (L1, L2, . . . ) and stores (S1, S2, . . . ) that are to be executed in the order: S1, L1, S2, L2, . . . . However, the out-of-order processor may select the instructions in a different order such as L1, L2, S1, S2, . . . .Some instruction set architectures require strong ordering of memory operations (e.g. the ×86 instruction set architecture). Generally, memory operations are strongly ordered if they appear to have occurred in the program order specified.
A typical computer system includes a memory hierarchy to obtain a relatively high level of performance at a relatively low cost. Instructions of different software programs are typically stored on a relatively large but slow non-volatile storage unit (e.g., a disk drive unit). When a user selects one of the programs for execution, the instructions of the selected program are copied into a main memory, and a processor (e.g., a central processing unit or CPU) obtains the instructions of the selected program from the main memory. Some portions of the data are also loaded into cache memories of the processor or processors in the system.
A cache is a smaller and faster memory that stores copies of instructions and/or data that are expected to be used relatively frequently. For example, central processing units (CPUs) are generally associated with a cache or a hierarchy of cache memory elements. Processors other than CPUs, such as, for example, graphics processing units (GPUs) and others, are also known to use caches.
Store and load instructions typically operate on memory locations in one or more caches associated with the processor. Load instructions may reference a memory location that is not in the cache. In the event of a cache miss, an entry is placed into a missed address buffer (MAB) and a cache line fill is requested. A typical cache line fill occurs over multiple clock cycles or beats. For example, a 64-byte cache line may be divided into 4 beats of 16 bytes. The beat containing the target of the load may be sent first (i.e., the beats may be sent out of order) so that the retrieved data may be forwarded to the load prior to the remaining beats being loaded or the cache line entry being written. After a load has allocated a MAB entry, it typically waits until the fill returns.
A system may employ prefetching to attempt to load cache lines into the cache prior to them being needed by a demand load. A prefetch load initiated by hardware or software may be used to facilitate the cache line fill. In response to the prefetch load missing the cache, an entry may be logged in the missed address buffer. In the meantime, other loads, such as demand loads, can execute and may reference the same cache line. For a load received after the load associated with the MAB entry, the MAB returns a “hit”, indicating the cache line in question is already in the process of being filled. Such subsequent loads must wait for the cache line to be written to the cache until they can be executed, because only the load associated with the MAB entry is available for data forwarding.
The goal of prefetching is to fill the cache line prior to a demand load targeting the cache line being serviced. If the cache line can be successfully prefetched, the latency for the later demand load can be reduced because the demand load will not see a cache miss. However, in some cases, the demand load is processed before the cache line fill for the prefetch load can be completed, so the demand load is queued behind the prefetch load. The demand load must wait for the cache line fill to complete prior to being serviced. If a prefetch had not been implemented, a cache miss would have been received by the demand load, and it would have been associated with the subsequent MAB entry and would have been eligible for data forwarding for the critical beat. If the demand load is received shortly after the prefetch load, the latency seen by the demand load could be greater with prefetching than it would have been without prefetching.
This section of this document is intended to introduce various aspects of art that may be related to various aspects of the disclosed subject matter described and/or claimed below. This section provides background information to facilitate a better understanding of the various aspects of the disclosed subject matter. It should be understood that the statements in this section of this document are to be read in this light, and not as admissions of prior art. The disclosed subject matter is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.