1. Field of the Invention
The present invention relates to the field of computer systems and associated cache memory structures. More particularly, the present invention relates to a cache controller and associated registers to permit multiple overlapping cache access operations.
2. Art Background
Typically a central processing unit (CPU) in a computer system operates at a substantially faster speed than main memory. When the CPU executes instructions faster than memory can supply them, the CPU must idle until the next instruction datum upon which the instruction will operate is available. To avoid excessive CPU idle time while awaiting data or instructions from the large main memory, a smaller cache memory capable of operating at a higher speed than the main memory is often used to buffer the data and the instructions between the main memory and the CPU.
The data and instructions in memory locations of the main memory are mapped into the cache memory in block frames. Each block frame consists of a block offset corresponding to a number of memory locations storing data and instructions associated with that block. To further improve the overall CPU performance, some computer systems employ separate cache memories, one for data and one for instructions.
However, the use of separate cache memories does not entirely solve the performance problem. When a cache read "miss" occurs, that is, when the datum or instruction requested by the CPU is not in the cache memory, the cache memory must retrieve the datum or instruction from the main memory. To do so, typically the entire block frame of data or instructions including the requested datum or instruction is retrieved, and the CPU idles until the entire block frame retrieval is completed. Many other cache performance problems and improvement techniques exist, the reader being referred to, for example, J. L. Hennessy and D. A. Patterson, Computer Architecture--A Ouantitative Approach, pp. 454-61, (Morgan Kaufmann, 1990).
The time necessary to fill the cache memory with the replacement block frame depends on the block size and the transfer rate of the cache memory-main memory hierarchy. For example, if the block size is eight (8) words and the speed of the main memory is two (2) words per three (3) dock cycles, then it takes eleven (11) dock cycles to fill the cache memory with the replacement block frame. However, reducing the block frame size or filling a partial block when a cache read miss occurs does not necessarily reduce CPU idle time, since smaller block size will increase the likelihood of future cache read misses.
Various techniques have been used to minimize the amount of CPU idle time waiting for the cache memory, and latency time waiting for completion of main memory accesses, when cache read misses occur. One common practice is "early restart", wherein as soon as the requested datum or instruction arrives in cache from main memory, it is sent to the CPU without waiting for the retrieval of the entire block to be completed. Using early restart, the CPU may resume execution of instructions upon receipt of the awaited instruction while the remainder of the replacement block frame is written to cache from main memory.
A further refinement of the early restart technique is "out of order fetch", wherein a request is made to main memory to retrieve the requested datum or instruction first, and skipping all the data or instructions before the requested datum or instruction in the replacement block frame. As in the case of early restart, the datum or instruction retrieved by out of order fetch is sent to the CPU as soon as it is retrieved, and the CPU may resume execution while the rest of the replacement block frame is being retrieved. After retrieving the requested datum or instruction, the main memory continues to retrieve the remaining data and instructions in the replacement block frame, starting with the data and instruction after the requested datum or instruction. The main memory then loops around to the beginning of the block frame to retrieve the previously skipped data or instructions, until the entire block frame is written to cache. Thus, the CPU can resume execution as soon as the first datum or instruction is retrieved from the main memory.
Traditional cache memories typically do not allow read and write operations to be performed against them in the same clock cycle. Thus, cache response to another request from the CPU while trying to fill the rest of the replacement block frame is quite complicated. As a result, the CPU typically idles again after the datum or instruction is executed, and waits for the remaining retrievals to be completed. The CPU will idle and wait for the remaining data or instructions being retrieved, even if the subsequent datum or instruction requested by the CPU is already in the cache memory. Thus, the benefits derived from early restart and out of order fetch are limited where the CPU is likely to complete its execution before the rest of the replacement block frame is written. This is especially likely to occur in computer systems where the number of clock cycles required to execute a typical instruction is small, for example, RISC (reduced instruction set computing) computers.
However, some modern cache memory structures allow read and write operations to be performed against them in the same dock cycle, thereby further reducing penalties associated with cache misses (particularly CPU idle time) and improving cache and overall system performance. For example, subsequent requests for data or instructions residing in the cache memory can be satisfied during the second half of the dock cycle. The problem is determining when the data or instructions are in the cache memory and synchronizing their transfer from the cache memory to the CPU during the second half of the dock cycle, without substantial investment in additional hardware. Likewise, a similar problem exists in satisfying the subsequent requests for data or instructions from the main memory.
Still more recently, computer systems having multiple processors have become common. In a multiple processor system, some or all of the several processors may simultaneously attempt to access the block flames stored in the cache, either for read or write purposes, and directing that data be routed to or from any of various sources and destinations with in the computer system. In a multiple processor system, proper system operation depends on maintaining proper correspondence of data stored in the cache with the corresponding processor, where any of several processors may access and alter cache-stored data. Correspondence of data to the proper processor is termed "cache consistency".
Thus, it is desirable to provide a new approach to controlling a cache memory to permit multiple outstanding read and write operations in an overlapping, substantially contemporaneous fashion in a high performance CPU that further reduces CPU idle time and latency between accesses to main memory and delivery of the requested instructions or data. It is particularly desirable if cache miss penalties are thereby reduced. It is also desirable if the hardware requirements necessary to implement the cache controller and associated control registers can be minimized.
As will be described in the following detailed description, these objects and desired results are among the objects and desired results of the present invention which overcomes the disadvantages of the prior art. The detailed description discloses a cache memory controller and methods for implementing a cache memory system for fetching data for a multiple processor computer system, and reducing CPU idle time by supporting multiple outstanding operations.