1. Field of the Invention
This invention generally relates to computer processors and more particularly, to reducing data load latency in processors by using early data return indication.
2. Description of Related Art
As computers systems continue to advance and become more complex, effective and efficient data transfer between various components of the computer systems has become more and more critical in computer system design and implementation. In particular, considerable effort and research has been focused on reducing and/or hiding memory latency in computer and network systems in order to improve processor and overall system performance. For example, many processors speculatively execute streams of instructions including speculatively retrieving load data from memory. In addition, some processors implement data speculation, in which the processor assumes that the correct data operands are available in the innermost cache. The processor verifies the assumption in parallel to instruction execution before committing the result to architecture state. Despite the efforts to reduce memory latency, high latency remains a major concern and one of the barriers to realizing faster and more efficient processors.
FIG. 1A is a block diagram illustrating a conventional prior art data speculation routine. After fetching at the fetch stage 102 and decoding at the decode stage 104, data speculation transaction begins at the execution phase 106, in which a request is made for data for an instruction or load micro-operation (load or load uops). The execution phase 106 may include accessing the first level cache (FLC) for the referenced data. A replay controller/checker (checker) 108 may be used to verify the contents of the FLC by doing FLC lookup 110. If the checker 108 determines the execution stage 106 received invalid data from the FLC, the execution results are discarded, and the execution is retried with data from the second level cache (SLC). The checker 108 may then verify the contents of the SLC by doing SLC lookup 112. If the checker determines the execution stage received invalid data from the SLC, the execution results are discarded and the execution is retried with data from the main system memory with memory lookup 114. When the checker 108 determines the request has received valid data, the request moves to the writeback phase 116 and then to the retire stage 118.
FIG. 1B is a block diagram illustrating a conventional prior art external transaction behavior of a request. The external behavior of the request includes, for example, a data load that misses the SLC. The transaction begins by making an external bus request at the request phase 122. The transaction then enters the snoop phase 124 during which cache coherency is maintained throughout the system. Some time later, the memory controller of the system has the requested data available, which it indicates to the processor at the response phase 126. The actual data is transmitted to the processor during the data phase 128. The processor then loads the data into internal caches (e.g., FLC and SLC).
For load operations that miss both the FLC and the SLC, the data is filled into the FLC, and the checker 108 determines that the request has received valid data. However, until the data is filled into the FLC, the request (e.g., load micro-operation) is replayed continuously resulting in significant power dissipation and waste of execution bandwidth. Furthermore, following requests may depend upon the results of this load for correct operation. The checker 108 determines these requests are dependent upon the uncompleted load request and directs the request to be replayed. For example, typical latency may be about 500 core clocks, which means replaying of the request may be made every 20 core clocks, resulting in nearly 25 replays before the data is ready to be filled into the FLC. Stated differently, a typical load request may consume about 20 bus clocks between initiation of the execution stage 106 and the writeback phase 116.
FIG. 1C is a block diagram illustrating a conventional prior art external transaction behavior of a request. To reduce the power dissipation and execution bandwidth waste, rescheduled request queue (RRQ) was introduced, illustrated as RRQ status 120. The RRQ status 120, which may be performed in parallel with the request phase 122, includes removing from the replay loop (e.g., after 3 loops) those loads that missed the SLC and inserting them into the RRQ. Since true dependent uops of the missing load may be difficult to identify, all requests (e.g., loads) that are newer than the missing load may be inserted into the RRQ.
At response phase 126, an RRQ wakeup protocol receives a data ready (DRDY) indication from the bus interface unit when the data is ready, triggering the RRQ draining. Although the use of RRQ may reduce replay re-executions, latency delay, i.e., the time between the when the request for data is made and until the request is satisfied, is still high, as the data placed in the FLC (i.e., the data received from the memory) stays there for a long time until the request from the RRQ is finally returned to the execution unit. Furthermore, additional core clocks may be required to vaporize the core unit, drain RRQ, and re-execute the loads, further contributing to lower processor performance.