Caches
A typical computer configuration comprises a processor of some kind coupled to a memory of some kind. It is desirable to match the speeds of the processor and the memory, so that the access time of the memory is roughly equal to the rate at which the processor reads and writes instructions and/or data. The processor speed will often be determined by the particular application for which the system is designed, and the required memory speed will follow from the processor speed.
In addition to memory speed the memory size will also have to be determined by the particular application for which the system is designed. The size of the memory must be large enough to accommodate all the information required by the processor for considerable periods of time. Transfers to and from other devices, e.g. hard disks, may be necessary in some systems, but it is desirable for the memory size to be large enough for such transfers to occupy a relatively small proportion of the total memory operating time. In other systems, e.g. routers in communication systems, the memory size must be large enough to store the messages passing through the system for sufficient time for them to received, processed, and retransmitted.
It may be noted that if the main memory is of a suitable type, it may be capable of block transfers which are considerably faster than random accesses. If it is a DRAM, for example, a memory access involves selection of a row followed by the selection of a column. Once that has been done, a block or burst transfer can be achieved by retaining the row selection and simply advancing the column selection word by word. However, this fast block transfer is obviously only possible if the words to be transferred are in a block of sequential addresses. With typical data processing, the sequence of addresses is normally not sequential, and reads and writes normally occur in an apparently random sequence, so fast block transfer cannot normally be used for such processing.
A memory satisfying these two requirements of speed and sufficient capacity is generally expensive and difficult to implement. To solve the problems of memory speed and size, the use of a cache memory has become common. A cache memory has a relatively small size and relatively high speed which is matched to the speed of the processor. The cache memory is used in conjunction with the main memory and allows the speed of the main memory to be considerably less than that of the cache with only a minor adverse impact on the speed of the system.
A cache memory is effectively an associative memory which stores the addresses of the data in it along with the data itself. The data and the addresses may both include parity or other error-checking bits if desired. A cache memory system is organized so that when the processor reads or writes data, such as a word, the address of the word is passed to the cache. If the operation is a write, then the word is written into the cache along with its address. If the access is a read and the address is in the cache, then the word is read from the cache. If the access is a read and the address is not in cache, then the word is read from main memory and written to cache at the same time.
The efficacy of the cache depends on the fact that in most programs, many words are accessed repeatedly. Once a word has entered the cache, subsequent operations on that word are achieved by accessing the cache. Since the cache speed is matched to the processor speed, the processor runs continuously or nearly so, with few waits for memory accesses to the main memory.
The simplest cache system has just a single cache, and both data words and instructions are stored in it. In some circumstances it may be more convenient to have two caches, one for data words and the other for instructions.
Cache structure
A true associative memory would be complex and expensive. A cache memory is therefore normally constructed to store, with each word, only a part of the address of that word. This partial address is called a tag. The cache is addressable in the conventional manner by the remaining part of the word address. When a cache location is addressed, the tag stored in that location is compared with the tag part of the full word address. If there is a match, i.e. a hit, the desired word is contained in the cache. The cache may contain tag comparison circuitry for comparing the tag in the desired address with the tag retrieved from the cache location, or the comparison may be performed by the processor.
Conventionally, the full address is split into a high part and a low part, with the low part being used to address the cache memory and the high part being used as the tag.
Cache operation
The cache organization, as described so far, allows the processor to read words from and write words to the cache provided that their addresses are already in the cache. Obviously, however, there will be times when a required address is not in the cache. There must therefore be a mechanism for entering fresh addresses into the cache. This will involve displacing words already in the cache, so the mechanism must also ensure that such displaced words are not lost but transferred into the main memory.
When a word is written, it is convenient to write it into the cache automatically, without first checking the cache to see whether the address is already present in the cache. What is actually written into the cache is an extended word, formed by concatenating the data word with the tag part of its address. This ensures that the word is cached if its address should be accessed.
When a word is to be read, its address is passed to the cache. If that address is not in the cache, then the address has to be passed to the main memory, so that the word is read from there. As with the write, this type of read ensures that the word is cached if its address should be accessed again. It is convenient for the word being read from the main memory to be copied immediately into cache; this writing occurs in parallel with the processor receiving the word and carrying out whatever operation is required on it.
Both reading and writing can thus result in the writing of a word with a fresh address into the cache, which results in the displacement of a word already in the cache, i.e. the overwriting of the word and its tag in the cache location into which the new word is being written. To avoid losing this displaced word, the system must ensure that it is copied into the main memory before it is displaced. This can conveniently be achieved by making every write, i.e. writing of a word by the processor, a write into main memory as well as into the cache. A write thus consists of writing into cache and main memory simultaneously. A write buffer can be interposed between the cached processor and the main memory, so that the operation of the system is not delayed by the long write time of the main memory if several words have to be written in quick succession.
This solves the displacement problem, because any word displaced from the cache will either be an unchanged copy of a word which has been obtained from and is still in main memory, or will have previously been copied into the main memory.
Variations on this mechanism for avoiding inconsistencies between the main and cache memories may be possible.
Interaction with external systems
The system described so far has been assumed to be largely self-contained: a processor, a main memory, and a cache. In practice, however, this system will usually be only a subordinate part, a subsystem, of a larger system. In particular, the main memory of such a larger system will be accessible by other parts of the system. The system generally includes a system bus to which the main memory is coupled, and the cache and the processor coupled together and coupled to the system bus via an interface unit, which contains a write buffer. The system bus will have various other devices coupled to it, which are called DMA (direct memory access) units. Depending on the system, the DMA units may be, for example, communications units for peripheral units.
The DMA units are so called because they can access the main memory directly, over the system bus, without involving the processor. This results in an inconsistency problem for the cache; since the contents of the main memory can be changed without the knowledge of the processor, the contents of the cache and the main memory can be inconsistent. Such inconsistent values are also called stale values.
This is not a problem as far as the DMA devices are concerned, because any changes made to the cache are copied directly into the main memory. There may in fact be a slight delay in this, because of the buffering of writes from the processor to the main memory, but this will generally not be significant. However, inconsistency between the cache and the main memory is a potentially serious problem as far as the processor is concerned.
Where there is a system with a cache and another agent, such as a DMA unit, where both the processor and the DMA unit may modify the main memory, special care has to be taken to ensure that stale data is not accessed. One method for avoiding the reading of stale data from a cache is invalidating relevant cache entries. The disadvantage of invalidation algorithms is that they incur overhead. In MIPS Computer Systems, Inc. R3000-based systems, for example, the customary algorithm incurs setup overhead to isolate the cache and drain the pipeline, extra instructions for each cache tag to specify what is to be invalidated so that the main memory will be accessed on the next read, and cleanup overhead to reconnect the cache to the main memory. A second method for avoiding the stale data problem is forcing an uncached read. The disadvantage of an uncached read is that the cache is not updated. This is especially important when a compiler is used since poor translation of algorithms may lead to repeated uncached access to the data, and since uncached accesses are expensive. It is also important in a system where the main memory is capable of block memory transfers and reads refer to the same memory block, even if the data is read only once, because individual uncached reads do not take advantage of this block transfer feature. A third method of dealing with stale data is the use of bus snooping mechanisms but additional, and often expensive hardware is required. Bus snooping mechanisms require extra hardware to monitor the memory side of the cache for write operations. They also assume that there is sufficient cache bandwidth available for the snooper to invalidate or update cache lines. This extra bandwidth either represents high cost for esoteric components, if available, or a less powerful CPU implementation.
It remains desirable to have a technique for dealing with the stale data problem in a cache memory by allowing the processor to perform a cached read of fresh data where there are inconsistencies between the cache and the main memory.