1. Field of the Invention
The present invention relates to the field of computer systems and their cache memory. More particularly, the present invention relates to cache misses.
2. Art Background
Typically the Central Processing Unit (CPU) in a computer system operates at a substantially faster speed than the main memory. In order to avoid having the CPU idle too often while waiting for data or instructions from the main memory, a cache memory which can operate at a higher speed than the main memory is often used to buffer the data and the instructions between the main memory and the CPU. The data and instructions in memory locations of the main memory are mapped into the cache memory in block frames. Each block frame comprises a plurality of block offsets corresponding to a plurality of memory locations storing a plurality of the data and instructions. To further improve the overall CPU performance, some computer systems employ separate cache memories, one for data and one for instructions.
However, the use of separate cache memories does not solve the problem entirely. When cache read miss occurs, that is, when the datum or instruction requested by the CPU is not in the cache memory, the cache memory has to retrieve the datum or instruction from the main memory. To do so, typically the entire block frame of data or instructions comprising the requested datum or instruction is retrieved, and the CPU goes idle until the retrieval is completed. For other cache performance problems and improvement techniques, see J. L. Hennessy, and D. A. Patterson, Computer Architecture--A Quantitative Approach, pp. 454-461, (Morgan Kaufmann, 1990).
The amount of time it takes to fill the cache memory with the replacement block frame is dependent on the block size and the transfer rate of the cache memory-main memory hierarchy. For example, if the block size is eight (8) words and the speed of the main memory is two (2) block offsets per three (3) clock cycles, then it takes eleven (11) clock cycles to fill the cache memory with the replacement block frame, assuming memory accesses are pipelined. Reducing the block frame size or filling a partial block on cache read miss does not necessarily reduce the CPU idle time, since it will increase the likelihood of future cache read misses.
Various techniques have been used to minimize the amount of CPU idle time waiting for the cache memory when cache read misses occur. One common practice is early restart, that is, as soon as the requested datum or instruction arrives, it is sent to the CPU without waiting for the writing of the entire block to be completed. Therefore, the CPU may resume its execution while the rest of the replacement block frame is being written.
A further refinement of the early restart technique is out of order fetch which is a request to the main memory to retrieve the requested datum or instruction first, skipping all the data or instructions before the requested datum or instruction in the replacement block frame. Like the early restart, the retrieved datum or instruction is sent to the CPU as soon as it is retrieved. Again, the CPU may resume its execution while the rest of the replacement block frame is being retrieved. After retrieving the requested datum or instruction, the main memory continues to retrieve the remaining data and instructions in the replacement block frame, starting with the data and instruction after the requested datum or instruction, and loops around to retrieve the previously skipped data or instructions at the beginning of the block frame, until the end of the block frame is reached. Thus, the CPU can resume execution as soon as the first datum or instruction is retrieved from the main memory.
However, traditional cache memory typically do not allow read and write operations to be performed against them in the same clock cycle. This makes the handling of another request from the CPU while trying to fill the rest of the replacement block frame complicated. As a result, the CPU typically goes idle again after the datum or instruction is executed, and waits for the remaining retrievals to be completed. The CPU goes idle and waits, even if the subsequent datum or instruction requested by the CPU is already in the cache memory or part of the remaining data or instructions being retrieved. Thus, the benefits from early restart and out of order fetch is limited, if the CPU is likely to complete its execution before the rest of the replacement block frame is written. This is especially likely to occur on computer systems where the number of clock cycles required to execute a typical instruction is small, for example, RISC computers, in particular, Super-Scaler RISC computers where more than one instruction is executed in each clock cycle.
Today, some modern cache memory do allow read and write operations to be performed against them in the same clock cycle, thus providing new opportunities for further reducing cache miss penalties, particularly CPU idle time, and improving cache and overall system performance. Subsequent requests for data or instructions that are in the cache memory can be satisfied during the second half of the clock cycle. The problem is knowing that the data or instructions are in the cache memory and synchronizing their read out from the cache memory to the second half of the clock cycle, without substantial investment in additional hardware. Likewise, to satisfy the subsequent requests for data or instructions that are in the process of being retrieved from the main memory, the problem is knowing when the data or instructions are retrieved and synchronizing their direct transfer to CPU with their retrieval, without substantial investment in additional hardware.
Thus, it is desirable to provide a new approach to fetching data from cache memory which allow read and write operations to be performed in the same clock cycle that further reduces CPU idle time. It is particularly desirable if cache miss penalties are reduced. It is also desirable if subsequent data being fetched by the CPU can be returned to the CPU during a cache memory fill and without having the CPU remain idle waiting for the cache memory fill to complete, if the data being fetched is part of the memory block frame currently being cached.
As will be described, these objects and desired results are among the objects and desired results of the present invention, which overcomes the disadvantages of the prior art, and provides a method and cache memory controller for fetching data for a CPU that further reduces CPU idle time.