Successful high-performance processor implementations require a high instruction completion rate. To achieve this goal, pipeline hazards must be minimized, allowing instructions to flow uninterrupted in the pipeline. Data hazards, an impediment to performance caused by instructions stalling for results from executing instructions, can be minimized by reducing or tolerating functional unit latencies. A need exists for a pipeline optimization scheme that reduces the latency of load instructions, resulting in fewer data hazards and better program performance.
There are many contributing factors to the latency of a load instruction. If a load hits in the data cache, the latency of the operation on many modern microprocessor architectures is 2 cycles: one cycle to compute the effective address of the load, and one cycle to access the data cache. If the load does not hit in the data cache, the latency is further increased by delays incurred with accessing lower levels of the data memory hierarchy, e.g., cache misses or page faults.
Much has been done to reduce the performance impact of load latencies. The approaches can be broadly classified into two areas: techniques which assist programs in tolerating load latencies, and techniques which reduce load latencies. Tolerating load latencies involves moving independent instructions into unused pipeline delay slots. This reallocation of processor resources can be done either at compile-time, via instruction scheduling, or at run-time with some form of dynamic processor scheduling, such as decoupled, dataflow or multi-threaded. For a given data memory hierarchy, a good approach to reducing load latencies is through better register allocation. Once placed into a register, load instructions are no longer required to access data.
There are, however, limits to the extent to which existing approaches can reduce the impact of load latencies. Tolerating techniques require independent work, which is finite and usually quite small in the basic blocks of control intensive codes, e.g., many integer codes. The current trend to increase processor issue widths further amplifies load latencies because exploitation of instruction level parallelism decreases the amount of work between load instructions. In addition, tolerating these latencies becomes more difficult since more independent instructions are required to fill pipeline delay slots. Global scheduling techniques have been developed as a way to mitigate this effect. Latency reduction techniques are also limited in their potential use. Latency reduction in the form of register allocation is limited by the size and addressability of register files, forcing many program variables into memory.
A solution to the problem of load latencies on performance impact is needed which additionally avoids the problems occurring in the prior art approaches. The present invention provides a solution to these and other problems, offering advantages over the prior art.