Cache memory systems have been designed to mitigate the access speed limitation of main storage by providing rapid access to instructions or data which are likely to be used during a relatively short time interval. Caches are used to reduce access latency and to reduce memory bandwidth requirements. The discussion that follows is primarily concerned with reducing access latency.
The available cache memory systems generally rely on empirically observed phenomena known as the spatial locality of reference and temporal locality of reference to determine which instructions to transfer from memory to cache before the instructions are actually referenced. These two phenomena refer to the tendency of a program, during any relatively small time interval, to access data or instructions which have addresses in the main storage that differ by a relatively small value. Stated another way, these properties hold that when a specific target instruction or datum is used by the processor, it is likely that the immediately adjacent instructions or data in the address space of main memory will be used close in time to the use of the target.
Transfers from memory to cache are made more efficient by fetching segments of instructions or data rather than single target instructions of data values. Up to a limit defined by the size and access speed of the cache, the larger the segment fetched, the greater is the likelihood that the next reference to the cache memory will succeed.
The efficiency of any cache memory system can be improved by reducing the cache miss rate. In some existing systems relatively elaborate methods have been developed to ensure that most program words and data values will be available in cache memory to satisfy the access requests of the central processing unit (CPU). For example, the IBM 360 model 91 computer was designed to fetch instructions before they were requested by the central processing unit (CPU). The system was designed to prefetch (and store in a buffer) up to 16 contiguous instructions ahead of the instruction being executed by the CPU. The system included special features for handling conditional branch instructions. Instructions of this type transfer control to one of two instruction streams based on the value of a condition determined at run time. The referenced computer system monitored the instruction stream for branch instructions. If one was encountered, the system would prefetch up to 16 instructions ahead on the not-taken path, plus the first four instructions on the branch-taken path. All of these instructions were stored in the program buffer. Any instructions which were not used were overwritten as buffer space was required.
A number of systems have employed look ahead mechanisms based on information extracted at compile time. Software based systems are embodied in compilers which generate code sequences that initiate prefetching of instructions, without a priori knowledge of the paths which are actually executed. Sequences of anticipatory program access requests are then communicated from the CPU to the memory at run time, when these prefetch instructions are executed.
With the advent of distributed computing over communications networks, the CPU and the cache are often separated from the main memory by a path which has a high bandwidth and, at the same time, a high latency. Such a configuration is typical in a campus or industry environment in which programs are stored in main memory in a server or other central computer system. The central computer provides program code to client CPUs, which execute the instructions. These client CPUs may be located as much as one kilometer away from the central computer.
Although communications technology improvements have resulted in continuing improvements in the data transfer rate as measured in bits per second, the actual latency time for a specific bit to travel from the server to the client is primarily determined by the physical distance and the propagation speed of the signal (e.g., the speed of light for a fiber optic connection, and lower speeds for electrically conductive media). The total latency is at least the sum of the propagation delay plus the period of a single word transmission (i.e., word size divided by the reciprocal of the network bit rate). As the network bit rate increases, the total latency approaches its lowest possible value determined by the propagation delay.
A number of systems have employed logic in the memory to proactively transmit instructions to the CPU, in order to reduce access latency. One of the earlier forms of program memory with such logic was the Fairchild F8 microcomputer system. The F8 provided memory addressing logic in Program Storage Units (PSUs), separated from the CPU. The PSU included a program counter and an address stack. For normal sequentially executed instructions, the PSU provided the next instruction to the CPU proactively. Additionally, a single level of program stack was implemented in the PSU, to allow the CPU to respond immediately to an interrupt and then return to the main program with minimal penalty.
Another prefetching system with logic in the memory is described in a paper by W. A. Halang, entitled, A Distributed Logic Program Instruction Prefetching Scheme, Microprocessing and Microprogramming vol. 19, 1987, pp. 407-415 which is hereby incorporated by reference for its teachings on computer system design. The logic in Halang's Program Storage Module (PSM) enables program memories to provide the CPU with sequential and non-sequential instruction streams, and to perform prefetching along two paths in anticipation of a single conditional branch instruction. Each instruction provided by the PSM to the CPU includes a flag bit which identifies whether the instruction belongs to the branch-taken path or the branch-not-taken path. The instructions accumulate in a dual buffer in the CPU, which executes instructions from one side of the buffer at a time. When a branch instruction is executed, the CPU switches to the indicated buffer (i.e. taken or not-taken) and clears the contents of the other buffer. The PSM is eventually notified of this selection and the unused instructions are discarded.
Other systems have sought to utilize compiler generated sequences through hardware mechanisms to overcome the penalty incurred by the software based prefetch systems when instruction access requests are communicated at run time. One such system is described in a paper by A. Dollas and R. F. Krick entitled, The Case for the Sustained Performance Computer Architecture, Computer Architecture News, Vol 17, No. 6, December 1989, pp 129-136 which is hereby incorporated by reference for its teachings on computer systems design. This paper discusses a system in which multiple Instruction Decode Units (IDU) are provided, each capable of managing a stream of instructions. The IDU's are each capable of prefetching sequential instructions and jumps. A stack in each IDU adds the capability to anticipate recursive code or nested code that may lead to multiple calls of a single function. A single program execution controller guides the distribution of instructions from the memory to the IDUs.
In recent years, access/execute computer architectures have been employed to overcome the memory access latency problems in systems such as the IBM RS/6000 and the Intel i860. The main feature in these architectures is the high degree of decoupling between operand access and instruction execution. Separate, specialized processors are provided for each. The access unit processor performs all address computation and performs all memory read requests. Communications between the access unit and the execution unit are accomplished via shared queues, rather than through memory. Many conditional branch instructions are handled by the access unit processor, allowing it to run ahead of the execution unit processor to fetch instructions before they are referenced.