A cache memory, which exploits locality of memory references, is often used to improve performance in a computer system. In a cache memory scheme, copies of the memory words likely to be accessed in the immediate future are kept in the cache memory. In some computer systems, instructions and data are cached in separate cache memories. For example, in an instruction cache, a small number of instructions residing in the next consecutive memory locations are stored. Recently, "on-chip" cache memories integrated with the central processing unit (CPU) are also common in microprocessor designs.
Under one scheme, the central processing unit (CPU) first look into the cache memory for the data to be read, and if the data is not found in the cache memory, the main memory is then accessed. If the data sought is found in the cache memory, a "cache hit" is said to have occurred. Conversely, if the data being accessed is not found in the cache memory, a "cache miss" is said to have occurred. The desired data is then refilled from the main memory. Each refill from the main memory typically brings in a block of memory words, one of which is the memory word that generates the cache miss. Many techniques in memory organization, such as using dynamic random access memories (DRAMs) under page mode, or using interleaved memory architecture, allow the memory system to deliver after an initial access time ("latency") successive memory words at time intervals much shorter than the initial access time. Each bus transaction may require one or more processor clock periods. These memory access methods, i.e., delivering memory words at rapid succession after initial latency, are called "burst" mode accesses, if the memory system delivers one memory word per processor clock cycles. If the memory words are delivered at a rate slower than one memory word per clock cycle, the access method is called "throttled" access. Burst mode read access is especially suited for cache refilling.
In one cache organization scheme, known as the "direct-mapped" organization scheme, each location in the cache memory is mapped by the lower order bits of the memory address into multiple locations in the main memory. The remaining bits of the memory address form a "tag" field in the data word stored in the cache memory. Depending on the specific organization of the cache, each tag may be shared by a number of data words inside the cache. The number of data words sharing a tag is known as the "line size." Usually, the lower order bits of the memory word address index into the memory words of the cache line. Under the direct mapped scheme, the cache memory is accessed using the lower order bits, and a cache miss occurs when the higher order bits of the memory address do not match the tag field of the cache memory word.
The efficiency of the computer system is enhanced if the CPU does not wait for a memory reference. Therefore, the memory cycle of an instruction cache is typically matched to the instruction cycle of the CPU. To achieve this rate of operation, very high performance memory technology must be used. If the cache memory system very high "hit" rate, slower but less expensive components can be used to implement the main memory. However, if the main memory is implemented using a lower performance technology than the cache memory, the refilling operations may take multiple processor clock periods. Under such condition, the designer of the CPU may be required to stall the CPU when a cache miss occurs, in order to wait for the main memory access to complete. In addition, the CPU designer typically has to provide the memory system designer flexibility to choose from a variety of memory system technologies (e.g. DRAMs of numerous speeds) to achieve a broad range of performance and economic objectives.
Two schemes are commonly used in the prior art to minimize the CPU stall time and maximize the benefits of burst mode access. In one scheme, the memory system is designed such that, after the initial latency, the rate of data arrival matches the processor cycle, so that no stall cycles are required after the initial latency. In the other scheme, a first-in-first-out (FIFO) memory, also called a read buffer, is used. The FIFO buffer can be provided either on-chip or off-chip. The processor stalls until the FIFO buffer is filled. No further stall cycles are required after the FIFO is filled. Under this scheme, since no communication of data readiness flow from the main memory to the CPU, CPU timing must assume the worst-case performance of the main memory.
Because of these factors, the total cost or performance of the computer system can be significantly impacted by the timing of data transfer between the main memory and the cache memory.