A cache is a buffer that holds information from a main memory for quick access by a processor. A branch target cache (BTC) is a cache designed to hold groups of branch target instructions which would come into play in the event of a branch, such as a jump instruction, in the program. When such branch instructions occur, the target instructions need to be fetched from a new place in the processor main memory, and the processor must wait for the target instructions to arrive after a read cycle to main memory. The BTC stores several instructions for each of various branch targets so that the processor may be kept busy during the main memory read cycles during which time it would normally be idle.
Generally speaking, a processor can access information from a cache memory in 20 to 50 nanoseconds, but takes 200 to 500 nanoseconds to access information from a main memory. Thus, by storing blocks of most frequently used branch target instructions, with each instruction block containing instructions immediately following and including the target instruction, processor idle time can be greatly reduced.
The efficiency of a cache is determined by the success rate of the processor in finding or "hitting" information within the cache, and the time spent in accessing the cache in the case of a "hit". The cache, however, has limited memory locations for storing information comprising blocks of instructions. If the cache is organized to hold many blocks containing few instructions per block, the success rate of hitting information within the cache increases, but the number of instructions fetched from cache for each hit is relatively small. In this case, processing is most efficient with a fast main memory, so that instructions following those stored in the BTC will be quickly available after depletion of the cache instructions. Idle cycles increase with a slower main memory.
The particular processor idle time problem addressed in the invention may be seen more clearly in reference to FIGS. 1A-1B and 2A-2C which set forth a simple example of block length configurations and latency in order to facilitate understanding of the problem and the solution offered by the invention.
FIG. 1A illustrates part of a BTC organized in groups of 4-instruction blocks. Only two such blocks are shown for simplicity. FIG. 1B illustrates an organization of 2 instructions in each of 4 blocks. Note that the total memory capacity, 8 instructions, is the same in both cases. In comparing the cache organization of FIGS. IA and 1B, it may be appreciated that when the cache is organized to hold fewer blocks containing more instructions per block, (FIG. 1A), the success rate of hitting information decreases, but processor idle cycles also decrease since more instructions are fetched from the BTC with each hit. In this case, processing is most efficient with a slower main memory. A fast memory would under-utilize locations in the cache.
FIGS. 2A-2C illustrate memory latency for different fixed organizations of instruction blocks stored in cache locations. In each case, the block size is known to the main memory and the correct next instruction is requested from memory after the instructions of the block have been examined.
FIG. 2A illustrates a long memory latency compared to the number of instructions organized in a block. Here, each block comprises 2 instructions and latency is 4 cycles. In response to a "jump" command, the processor searches the blocks of the BTC looking for a match of the jump target instruction. The cache supplies the first two instructions, but the processor must then idle, waiting for the next group of instructions from main memory. In the example of FIG. 2A, the memory latency period is four cycles so that the processor is idle for two cycles waiting for instructions from main memory.
FIG. 2B illustrates a memory latency that matches the number of instructions organized in a block, the ideal case. Here, each block comprises 4 instructions and the latency period is again 4 cycles. The processor receives information at a rate that is equal to the ability of the processor to use the information, and to fetch additional instructions from main memory. These additional instructions from main memory are supplied just in time to the processor without any idling cycles after execution of the four instructions retrieved from the BTC.
FIG. 2C illustrates a short memory latency period compared to the number of instructions in a block. Here, each block comprises 4 instructions and the latency period is 2 cycles. In this case, the main memory has supplied instructions to the processor before the processor has completed execution of the instructions from the BTC. The third and fourth instructions retrieved form the BTC could have been supplied by the main memory directly to the processor. The locations occupied by these third and fourth instructions could have been used to hold instructions for other branches.
Accordingly, a cache is needed that will overcome problems of memory latency and function efficiently with either fast or slow main memories.