Some applications executable by a processor, such as graph analytics, search operations, etc., may involve utilization of large data sets. Related instruction code for these applications may include instructions in the form of data-dependent load instructions. A data-dependent load instruction, as known in the art, is used to load data from an address which is dependent upon data loaded by a prior load instruction (keeping in mind that the prior load instruction need not necessarily be a different load instruction but may be a prior execution of the same data-dependent load instruction).
A data-dependent load instruction presents challenges which other forms of load instructions may not. For instance, for load instructions which load from addresses which are not dependent upon prior loads, the addresses may follow patterns among the load instructions in a code sequence, which enables for predictive prefetching from the addresses based on determining strides among the patterns. However, for data-dependent load instructions, such pattern-based or stride-based prediction is not possible because the address from which to load data is itself dependent upon a different instruction.
FIG. 1 illustrates examples of data-dependent load instructions in instruction sequence 100 (which will be recognized by one skilled in the art to represent a pointer-chasing code). In instruction sequence 100, two types of data-dependent load instructions are illustrated. Firstly, Load 2 is a load instruction for loading data from an address determined by register x5, wherein the content of register x5 is determined by a different load instruction, Load 1. In this instance, Load 1 is alternatively referred to as a parent or producer load instruction of the data-dependent load instruction Load 2. The sequence of the parent and data-dependent load instructions, Load 1 and Load 2 in instruction sequence 100 is referred to as an instruction slice, wherein executing the instruction slice is dependent upon the content of the register x5 being made available. Secondly, Load 1 is also a data-dependent load instruction. In this case, considering two successive iterations of the loop defined by the “while (ptr)” in instruction sequence 100, the data contained at an address pointed to by register x5 is loaded into register x5 in the execution of Load 1 in a first iteration of the loop; and in a successive, second iteration of the loop, the value of register x5 from the first iteration is loaded in the execution of Load 1, which makes Load 1 of the first iteration a parent load instruction and Load 1 of the second iteration a corresponding data-dependent load instruction.
In the above-noted example applications such as graph analytics and search operation workloads, wherein the above instruction slice may be executed by a processor having a one or more caches in a memory hierarchy, it is seen that there is a high incidence of both loads (parent and dependent) in an instruction slice encountering a miss in one or more caches. A cache miss in a last-level cache (or “LLC”) of the memory hierarchy may incur high penalties. To explain, the last-level cache such as a level-3 (L3) cache may be integrated on the same chip as the processor and used to service misses, when possible, from higher level caches such as level-2 (L2) cache, level-1 (L1) cache, etc., which are in closer proximity to the processor. But a miss in the last-level cache may incur large penalties in latency, e.g., in the order of hundreds of cycles, because the miss be forwarded to an external memory system or an off-chip memory such as a dynamic random access memory (DRAM), for example, to be serviced. Therefore, in the event of a last-level cache miss for the parent load instruction (e.g., for fetching the data at an address pointed to by register x5), any data-dependent load instructions (e.g., Load 2), as well as any dependents thereof may be stalled until the parent load instruction is serviced by accessing the DRAM. While waiting for the parent load instruction to be serviced, processing the dependent instructions is stalled, and the processor's execution pipeline may get backed up with further instructions which may be dependent on the parent load or dependent instructions thereof, which can lead to degradation in performance of the processor.
Accordingly, there is a need in the art for improving performance while avoiding the aforementioned drawbacks of conventional techniques in the processing of data-dependent load instructions.