This relates to the operation of cache memory in multi-processor computing units. Extensive description of cache memories may be found in A. J. Smith "Cache Memories" Computing Surveys, Vol. 14, No. 3, pp. 473-530 (September 1982); in K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, pp. 98-118, (McGraw-Hill, 1984); and in A. J. Smith, "Cache Memory Design: An Evolving Art", IEEE Spectrum, Vol. 24, No. 12, pp. 10-44 (December 1987), which are incorporated herein by reference.
A cache memory is a small, high-speed buffer memory inserted between the processor and main memory of a computer and as close to the processor as possible. The cache memory duplicates and temporarily holds portions of the contents of main memory which are currently in use or expected to be in use by the processor. Additionally, a cache memory may be inserted between main memory and mass storage.
The advantage of cache memory lies in its access time, which is generally much less than that of main memory, illustratively five to ten times less. A cache memory thus permits an associated processor to spend substantially less time waiting for instructions and operands to be fetched and/or stored, permitting a much decreased effective memory access time and resulting in an overall increase in efficiency. Illustrative memory access times for typical large, high-speed computers such as the Amdahl 580 and IBM 3090 are 200 to 500 nanoseconds for main memory and 20 to 50 nanoseconds for cache memory. The advantages obtained from use of cache memory similarly exist in medium and small computers.
Data in cache memory is arranged in the form of a plurality of block frames or lines, with a single block frame or line generally being of the same size as a block of main memory. The optimal size of a block frame, i.e., the size yielding the lowest average delay per memory reference, depends largely on cache size and access time parameters. By way of illustration, a computer system may have a cache memory block frame size of four bytes for a small 32 byte cache up to 128 bytes for a large 128 kilobyte cache. Main memory will be much larger. When it becomes necessary to update a cache with data from main memory, data within a block frame or a plurality of block frames of the cache is replaced with data from a block or blocks of the main memory.
Unfortunately, neither the computer nor the programmer can anticipate all of the data to be used presently or in the near future and therefore can not provide ideal data to the cache. Furthermore, not all data to be used in a current process will necessarily fit within a cache. Such considerations give rise to the concept of a "hit" and conversely, a "miss". A hit is produced when a processor references data contained within a cache while a miss results when a processor references data not contained within a cache. In the case of a miss, the data must be accessed from main memory, provided to the cache, and then provided to the processor. Such referenced data, whether ultimately producing a hit (referenced data within cache) or a miss (referenced data in main memory only), is known as a target.
The effectiveness of the cache is measured primarily by the hit ratio "h", i.e., the fraction of targets which produce a hit, or its complement the miss ratio (1-h), as well as the mean time required to access the target if a hit occurs. The design of a computer system having a cache involves minimization of the miss ratio as well as minimization of the mean access time associated with a hit. However, in addition to the primary considerations of low miss ratios and low access times for a hit, secondary considerations should be taken into account in the design of any system incorporating a cache. Such secondary considerations include the following: reduction of main-memory access time upon the occurrence of a miss; reduction of the total information demanded in a multi-processor system so as to reduce queues at main memory; and elimination of any cache cycles lost in maintaining data coherency among multi-processor caches.
Numerous tradeoffs are encountered in any attempt to optimize the above-mentioned considerations. For example, line size, cache size, the degree of associativity, real versus virtual addressing of the cache, when to update main memory, the number of caches and the type of priority scheme among caches must all be determined.
More specifically, the line size affects the amount of delay from cache misses as well as the miss ratio. For example, as the line size increases from a minimum, the miss ratio will at first decrease due to an increased amount of data being fetched from main memory with each miss. However, as the line size further increases, the miss ratio will increase as the probability of needing the newly fetched data becomes less than the probability of reusing the information which was replaced.
The line size also affects the percentage of cache memory which can be dedicated to information storage as distinguished from address storage. For example, a cache utilizing a 64 byte line with a two byte address can store significantly more information than can a cache utilizing a 6 byte line with a two byte address. Additional considerations relate to longer queues and delays at the memory interface associated with longer lines, I/O overun, the frequency of line crossers (memory references spanning the boundary between two cache lines) and the frequency of page crossers (memory references spanning the boundary between two pages).
Cache size, similar to line size, affects the miss ratio, with a larger cache having a lower miss ratio. However, as cache size is increased, rise times are also increased, thus resulting in large caches which are slightly slower than the smaller caches. Additionally, larger caches are more costly, require larger integrated circuit chips and correspondingly larger circuit board area, and require more power and therefore cooling.
The degree of associativity also affects the miss ratio and cache performance. Associativity relates to the number of information elements per set in a cache. Set associative caches map an address into a set and search associatively within the set for the correct line. A fully associative cache has only one set. A direct mapped cache 20 has only one information element per set. Increasing the number of elements per set generally decreases the miss ratio. For example, a set size of two elements is significantly better than direct mapping; and a set size of four elements is better yet, although only by a small margin. However, increasing associativity not only produces additional delays, but is costly in both a monetary sense and the sense of silicon area requirements. In general, a large cache already having a low miss ratio will benefit more from short access times associated with direct mapping, while a small cache having a higher miss ratio will benefit more from a set associative cache.
When to update main memory also affects system operation. Information in a cache that has been modified by a CPU must eventually replace the corresponding stale information in main memory. Known methods to perform in such updating include write-through, in which the information in main memory is updated immediately as it is modified, and copy-back, in which the information in main memory is only updated when the line containing the corresponding modified information in the cache is replaced. See for example, L. M. Censier and P. Feautrier, "A New Solution to Coherence Problems in Multicache Systems", IEEE Transactions on Computers, Vol. C-27, No. 12, p.1112 (December 1978); M. Dubois and F. A. Briggs, "Effects of Cache Coherency in Multiprocessors", IEEE Transactions on Computers, Vol. C-31, No. 11, p.1083 (November 1982); A. Wilson, "Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors" (Encore Computer Corp., ETR-86-006 1986). Although write-through is generally simpler and more reliable, it generates substantial memory traffic.
The degree of success of a cache memory is attributed to, inter alia, the property of "locality". Locality has temporal as well as spatial components. Over short periods of time, a program generally distributes its memory references non-uniformly over its memory address space. Furthermore, the specific portions of the address space which are addressed tend to remain largely the same for long periods of time. Temporal locality relates to the phenomenon that data which will be required in a relatively short period of time is probably in use at the present time. Temporal locality is especially prevalent in scenarios in which both instructions and data, i.e., operands, are reused. Spatial locality relates to the phenomenon that portions of the address space which are presently in use generally consist of a relatively small number of individually contiguous segments of that address space. In other words, the loci of reference of the program in the near future are likely to be near the current loci of reference. Spatial locality is especially prevalent in scenarios in which related data items exist such as arrays and variables since they are typically stored together, and also scenarios in which instructions are executed sequentially, which is generally true. Thus, the cache which contains data (instructions and operands) that has recently been used is likely to also contain data that will be required in a short period of time.
A significant factor affecting efficiency of computer systems having cache memory lies in the type of block frame replacement method utilized to replace block frames in the cache with blocks from main memory. Such a block-by-block replacement is necessitated whenever a miss is encountered. Not only is a fetch from main memory necessary, but a decision must be made as to which of the block frames in a cache is to be deleted and replaced by a block of main memory. Numerous block replacement algorithms have been proposed to intelligently choose which block frame is to be replaced. Illustrative of such block replacement algorithms are the random (RAND), first-in, first-out (FIFO) and least recently used (LRU) methods. Block replacement algorithms generally are implemented entirely in hardware since they must execute with high speed so as not to adversely affect processor speed.
RAND replaces a randomly chosen block frame of the cache upon the occurrence of a miss. FIFO replaces the time-wise longest resident block frame of the cache upon the occurrence of a miss. LRU replaces the least recently referenced resident block frame of the cache upon the occurrence of a miss. Although LRU is generally the most efficient, FIFO is often used in the smaller computers due to cost considerations. For a detailed analysis of an LRU implementation, attention is directed to "Computing Surveys", Vol. 14, No. 3, September 1982, pp. 498-500.
The same advantage of reduced memory access time that prompts the use of cache memories in a single processor system is also available in multi-processor systems. However, in such systems the use of different data streams and conventional block frame replacement algorithms almost inevitably creates a situation in which the contents of the cache memories of the different processors are all different. In such circumstances, even if the miss ratio at each cache remains within normal limits, the demands made on main memory and its output communication channel to the cache memories can be severe. As a result, average memory access time can be degraded or extraordinary measures must be taken to enhance the throughput (or bandwidth) of the main memory and its output communication channel.
These problems are especially acute in computers where large numbers of parallel processors are operated together in processor arrays. Several such computers are commercially available. Of particular interest is the Connection Machine (Reg. TM) computer made by the present assignee, Thinking Machines, Inc. of Cambridge, Mass. This computer is described more fully in U.S. Pat. No. 4,598,400, which is incorporated herein by reference. The Connection Machine Computer system comprises a central computer, a microcontroller, and an array of as many as 65,536 parallel processors in presently available embodiments. The central computer may be a suitably programmed commercially available computer such as a Symbolics 3600-series LISP Machine. The microcontroller is an instruction sequencer of conventional design for generating a sequence of instructions that are applied to the array of parallel processors by means of a thirty-two bit parallel bus.
Numerous techniques are available for interconnecting the processors of a multi-processor system to a shared memory. These include a shared bus connecting the shared memory to each processor, a hierarchical bus such as that disclosed in the above-referenced Wilson paper and numerous types of interconnection networks such as those described in C. Wu and T. Feng, Tutorial: Interconnection Networks For Parallel and Distributed Processing (IEEE 1984).