In a conventional non-pipelined data processor, each instruction executes to completion before the instruction commences. In order to improve the efficiency of machine operations, while increasing overall performance, pipelined data processors are implemented in conventional data processor designs. These pipelined data processors are capable of executing several instructions concurrently by overlapping sub-operations of successive instructions. Optimally, pipelined processors fetch one new instruction, and complete execution of another instruction every clock cycle. Thus, although the actual execution time required for complex instructions varies, the overall instruction execution rate may approach one instruction per clock cycle. As a result, the use of pipelined processors dramatically improves the overall performance of the data processor.
In order to achieve single cycle instruction execution, an instruction prefetch unit (IPU) must maintain an instruction stream capable of loading the instruction pipeline with the requisite number of instruction words every clock cycle. If the IPU fails to maintain the required instruction stream, and the instruction pipeline is not loaded with the requisite number of instruction words, a pipeline stall may occur. Generally, today's high performance pipelined data processors employ an instruction cache to provide the IPU with rapid access to instruction data (operands). Typically, the instruction cache is maintained by a cache controller, which operates in concert with the IPU to retrieve (prefetch) instructions and keep the instruction buffer (queue) loaded. Accordingly, when the processor requests an instruction prefetch, the cache controller receives the prefetch requests and determines whether the instruction is resident in the instruction cache. If the requested instruction is resident in the cache, a prefetch "hit" occurs, and the cache controller loads the instruction buffer directly from the instruction cache. If the requested instruction is not resident in the cache, a prefetch "miss" occurs, and the cache controller requests a bus transfer to retrieve the required cache line from external memory.
Known cache controllers use a burst mode transfer to transfer a cache line (e.g. 16 bytes) in a single memory access. Typically, in the burst mode, only the starting address of the 16 bytes is transferred to memory, therefore, only one memory access is required. Generally, the cache controller loads the instruction cache with the required cache line, immediately after the data becomes valid. The next prefetch from the processor is, therefore, stalled for a cache load cycle. A performance penalty occurs as a result of the processor stall required for the instruction cache load. Efforts to ameliorate the performance penalty attributable to cache writes from a data bus have centered around the use of buffers to temporarily store the data for a pending cache load. Typically, these buffers (commonly referred to as "push" buffers) provide the requested data to the integer unit, via an internal bus. Generally, previous systems have not provided a mechanism to directly access the cache line stored in the push buffer during a subsequent prefetch request for the same cache line. Thus, although the use of push buffers may alleviate the problem of stalling the processor for a cache load cycle, these push buffers are not accessible in parallel with the instruction cache. Consequently, a subsequent prefetch request from the processor for data contained in the cache line stored in the push buffer results in another bus transfer to retrieve the required data from external memory. This duplicative bus transfer creates yet another performance penalty.