One of the most critical paths in high-performance processors is the load-to-use path, which is defined as the time between computing the address of a load to writing the data back to registers. An increase in the number of cycles in the load-to-use path directly affects the performance (measured in “instructions per cycle”) of the processor.
The critical components of the load-to-use path in a conventional Out-Of-Order (“OOO”) processor for any given load instruction typically comprise the following:
1. Computing the address of the load.
2. Looking up a fully-associative translation lookaside buffer (“TLB”) and a set associative Level-1 Data Cache (“L1 D-cache”).
3. Carrying out load address disambiguation with the address of older stores in the Load Store Queue (“LSQ”), which may result in forwarding data from the youngest store older than the load.
4. Merging data from the L1 D-cache and LSQ if the load hits partially in the LSQ.
5. Writing back the loaded data to a register file or bypassing it in order to execute an instruction on the fly.
As a result of the above-listed various stages, the load-to-use path is highly time intensive in conventional processors and can result in undesirable latencies.