1. Field of the Invention
The present invention relates to data processing systems utilizing cache memories, and more particularly to reducing the effective latency for nonsequential accesses of a cache.
2. Prior Art
Caches are used in various forms to reduce the effective time required by a processor to access instructions or data that are stored in main memory. The theory of a cache is that a computer system attains a higher speed by using a small portion of very fast memory as a cache along with a larger amount of slower main memory. The cache memory is usually placed operationally between the data processing unit and the main memory. When the processor needs to access main memory, it looks first to the cache memory to see if the information required is available in the cache. When data and/or instructions are called from main memory, information is stored in the cache as part of a block of information (known as a cache line) that is taken from consecutive locations of main memory. During subsequent memory accesses to the same addresses, the processor interacts with the fast cache memory rather than main memory. Statistically, when information is accessed from a particular block in main memory, subsequent accesses most likely will call for information from within the same block. This locality of reference property results in a substantial decrease in average memory access time.
There are two major conflicting goals in designing caches. First, it is desired that cache size be large so that off-chip memory accesses in case of a cache miss are minimized. Second, as processor speeds increase, it becomes especially important that caches are designed to be fast enough to return instructions and data to the processor without slowing down overall system performance. Unfortunately, as the cache gets larger, it also gets slower due to an increase in the parasitic capacitance of the cache memory.
A number of techniques are used to reconcile these two goals. By dedicating the cache to only a certain type of data, one can reduce the relative size required of the cache. For example, many processors incorporate separate instruction and data caches. Further, because the pattern of access for instruction caches is typically sequential, the hit/miss ratio is relatively high. Thus, the need to go off-chip to retrieve instructions is reduced and performance is enhanced.
Two factors contribute to the measure of the speed of a cache. The latency of a cache is the delay (typically measured in processor cycles) between presenting an address to the cache and receiving the requested data from the cache. The throughput of the cache is a measure of the number of memory access operations that can be performed in any one time period. During the latency period, the cache may be considered to have an idle period in which no data is returned from the cache in response to the address. The duration, L, of the idle period is one cycle less than the latency period.
It is known in the art that pipelined memory systems can use prefetching to increase their throughput. The Intel i960CA.TM. and i960CF.TM. processors, manufactured by Intel Corporation of Santa Clara, Calif., are examples of processors that support pipelined memory systems. In particular, an instruction cache may be implemented as a two-stage pipelined cache, for example. During the first stage of the pipeline, an instruction address (instruction pointer) is presented to the tag array of the cache. The results are latched for one cycle, and during the second stage the memory access is continued by accessing the cache instruction array lines in the case of a hit, or accessing memory in the case of a miss. In other words, the instruction address may be presented in cycle one, the cache is in a wait state in cycle two, and if the instruction address hits the cache, the instruction is returned in cycle three.
The latency of the above-described pipelined cache is two cycles. However, the effective latency can be decreased to one cycle by prefetching instructions from subsequent sequential addresses during the idle cycle. During cycle 2, the instruction sequencer (program counter) increments the instruction pointer to point to the next instruction to be fetched, and presents this pointer address to the cache. As a result, the instruction found at the address presented in cycle one is returned in cycle three, and the subsequent instruction is returned in cycle four. Thus, the throughput of the cache has been increased by one hundred percent from one instruction every other cycle to one instruction per cycle.
One skilled in the art will recognize that the number of stages of the pipelined cache may take on a wide range to accommodate system requirements. Further, one skilled in the art will recognize that the number by which the instruction pointer is incremented during each pipeline stage may vary depending upon whether the processor is superscalar (issues multiple instructions per cycle), and the number of pipeline stages, among other factors. The only requirement is that the instruction pointer be incremented to point to the instruction immediately after the last instruction fetched in the previous cycle.
Using the method of sequentially prefetching the instruction from the pipelined cache, instruction throughput may be maintained at a relatively high rate. However, the pipelined cache suffers a performance penalty when nonsequential memory accesses are encountered. Nonsequential accesses include branches, calls and interrupts, among other changes in instruction flow. As mentioned above, the instruction sequencer causes instructions to be prefetched by sequentially incrementing the instruction pointer. When a branch instruction is encountered, however, instruction flow must be redirected to the target address specified by the branch instruction. The processor requires a number of cycles to decode the branch instruction to detect that a branch instruction has been encountered, and to determine the branch target address at which instruction flow is to continue. During this time period, the pipelined cache returns prefetched instructions that lie in the sequential instruction flow immediately after the branch instruction. After the branch has been detected, these prefetched instructions must be flushed or allowed to drain from the pipeline without being executed, and instruction flow must be redirected to the branch target address.
When the branch target address is presented to the pipelined cache, the instruction at that address will be returned after a time period equal to the latency of the pipelined cache. Because branch instructions occur at a rate of approximately one out of every five instructions in a typical computer program, this delay creates a severe degradation in instruction throughput. This degradation is exacerbated in superscalar machines where each cycle of latency represents the delay of not just one instruction but of many.
It is desired to enhance the performance of a pipelined cache by reducing the effective latency caused by nonsequential memory accesses.