A more detailed description of some of the basic concepts discussed in this application is found in a number of references, including Hennessy, John L., et al., Computer Architecture--A Quantitative Approach" (Morgan Kaufmann Publishers, Inc., San Mateo, Calif., 1990). Hennessy's text, particularly Chapter 8, provides an excellent discussion of cache memory issues addressed by the present invention.
A cache is a relatively small high-speed memory that is used to hold the contents of most recently used blocks of main storage. It is used to bridge the gap between fast processor cycle time and slow memory access time. A cache is physically distinct from the main memory and is usually not addressable by programs. Conventional data processing systems having a cache memory system comprise a central processor unit (CPU or microprocessor), a cache memory, and a secondary memory (for example, a higher level cache memory or main memory). The processor generates an operand address which is compared, with or without translation, to the addresses in the cache memory. If there is a cache "hit", the addressed location in the cache memory is accessed. If there is a cache "miss", the address is translated and then passed to the secondary memory. The data is retrieved from secondary memory and put into the cache. This usually causes the CPU to pause until the data is available.
The success of cache memories comes from an empirical observation known as the "property of locality", which can be stated as: information in use in the near future is likely to consist of that information in current use (locality by time), and that information logically adjacent to that in current use (locality by space).
Not all objects referenced by a program need to reside in main memory. If the computer has "virtual memory", then some objects may reside on disk, or the like. The address space is usually broken into fixed-size blocks, called "pages". At any time, each page resides either in main memory or on disk. When the CPU references an item within a page that is not present in the cache or main memory, a "page fault" occurs, and the entire page is moved from the disk to main memory. The cache and main memory have the same-relationship as the main memory and disk.
A memory hierarchy normally consists of many levels, but it is managed between two adjacent levels at a time. The "upper" level--the one closer to the CPU--is smaller and faster than the "lower" level. The minimum unit of information that can be either present or not present in the two-level hierarchy is called a "block". The size of that block may be either fixed or variable. If it is fixed, the memory size is a multiple of that block size. Success or failure of an access to the upper level is designated as a hit or a miss.
The memory address is divided into pieces that access each part of the hierarchy. The "block-frame address" is the higher-order piece of the address that identifies a block at that level of the hierarchy. The "block-offset address" is the lower-order piece of the address and identifies an item within a block. The size of the block-offset address is log.sub.2 (size of block). The size of the block-frame address is then the size of the full address at this level less the size of the block-offset address.
The CPU must have some mechanism to determine whether or not information is in the top level of the memory hierarchy. This check happens on every memory access and affects hit time. Maintaining acceptable performance usually requires the check to be implemented in hardware. The final implication of a memory hierarchy is that the computer must have a mechanism to transfer blocks between upper- and lower-level memory. If the block transfer requires tens of clock cycles, it is controlled by hardware. If it requires thousands of clock cycles, it can be controlled by software. For caches, the word "line" is often used instead of block.
Restrictions on where a block is placed create three categories of cache organization. If each block has only one place it can appear in the cache, the cache is said to be "direct mapped". The mapping is usually (block-frame address) modulo (number of blocks in cache). If a block can be placed anywhere in the cache, the cache is said to be "fully associative". If a block can be placed in a restricted set of places in the cache, the cache is said to be "set associative". A "set" is a group of two or more blocks or lines in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within the set. The set is usually chosen by bit selection; that is, (block-frame address) modulo (number of sets in cache). If there are "n" blocks in a set, the cache placement is called "n-way set associative".
The range of caches from direct mapped to fully associative is really a continuum of levels of set associativity. Direct mapped is simply one-way set associative and a fully associative cache with n blocks could be called n-way set associative.
Caches include an address tag on each block that gives the block-frame address The tag of every cache block that might contain the desired information is checked to see if it matches the block-frame address from the CPU. Because speed is of the essence, all possible tags are searched in parallel. Serial searching would make set associativity counter productive.
There must be a way to know that a cache block does not have valid information. The most common procedure is to add a "valid bit" to the tag to say whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address. One tag is required for each block. An advantage of increasing block sizes is that the tag overhead per cache entry becomes a smaller fraction of the total cost of the cache.
A CPU address is divided into three fields to find data in a set-associative cache. The "block-offset" field used to select the desired data from the block. The "index" field used to select the set. The "tag" field used for the comparisons. While the comparison could be made on more of the address than the tag, there is no need. If total size is kept the same, increasing associativity increases the number of blocks per set thereby decreasing the size of the index and increasing the size of the tag.
A benefit of direct-mapped placement is that hardware decisions are simplified. In fact, so simple that there is no choice: only one block is checked for a hit, and only that block can be replaced. With fully associative or set-associative placement, there are several blocks to choose from on a miss. There are two primary strategies employed for selecting which block to replace.
First, candidate blocks are randomly selected. Some systems use a scheme for spreading data across a set of blocks in a pseudo-randomized manner to get consistent behavior.
The second approach is called the "least-recently used" (LRU) technique. To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. The block replaced is the one that has been unused for the longest time. This makes use of a corollary of temporal locality: If recently used blocks are likely to be used again, then the best candidate for disposal is the least recently used.
Because at any instant in time computers are running multiple processes, there must be a means of sharing a smaller amount of physical memory between many processes. One way to do this, called "virtual memory", divides physical memory into blocks and allocates them to different processes. With virtual memory, the CPU produces "virtual" addresses that are translated by a combination of hardware and software to "physical addresses", which can be used to access main memory. This process is called "memory mapping" or "address translation". Today, the two memory hierarchy levels controlled by virtual memory are DRAMs and magnetic disks.
There are further differences between caches and virtual memory. Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system; the longer miss penalty means the operating system can afford to get involved and spend more time deciding what to replace. The size of the CPU address determines the size of virtual memory, but the cache size is normally independent of the CPU address. In addition to acting as the lower-level memory for main memory in the hierarchy, secondary storage is also used for the file system that is not normally part of the address space. Most secondary storage is taken up by the file system.
Virtual memory encompasses several related techniques. Virtual memory systems can be categorized into two classes: those with fixed-size blocks, called "pages", and those with variable size blocks, called "segments". Pages are typically fixed at 512 to 8192 bytes, while segment sizes vary. The largest segment supported on any machine ranges from 2.sup.16 bytes up to 2.sup.52 bytes; the smallest segment is one byte.
The decision to use paged virtual memory versus segmented virtual memory affects the CPU. Paged addressing has a single, fixed-size address divided into page number and offset within a page, analogous to cache addressing. A single address does not work for segmented addresses. The variable size of segments requires one word for a segment number and one word for an offset within a segment, for a total of two words. An unsegmented address space is simpler for the compiler. Both paging and segmentation rely on a data structure that is indexed by the page or segment number. This data structure contains the physical address of the block. For paging, the offset is simply concatenated to this physical page address. For segmentation, the offset is added to the segment's physical address to obtain the final virtual address.
The data structure containing the physical page addresses usually takes the form of a "page table". Indexed by the virtual page number, the size of the table is the number of pages in the virtual-address space. To reduce address translation time, computers use a cache dedicated to these address translations. This specialized cache is called a translation-lookaside buffer (TLB), or simply translation buffer.
Almost all operating systems try to replace the LRU block, because that is the one least likely to be needed. To help the operating system estimate LRU, many machines provide a "use bit" or "reference bit", which is set whenever a page is accessed. The operating system periodically clears the use bits and later records them so it can determine which pages were touched during a particular time period. By keeping track in this way, the operating system can select a page that is among the least-recently referenced.
A TLB entry is like a cache entry where the tag holds portions of the virtual address, and the data portion holds a physical page-frame number, protection field, use bit, and a dirty bit. A feature called the "dirty bit" is commonly used to reduce the frequency of writing back blocks on replacement. The dirty bit is a status bit that indicates whether or not the block was modified while in the cache. If it wasn't, the block is not written, since the lower level has the same information as the cache. To change the physical page-frame number or protection of an entry in the page table the operating system must make sure the old entry is not in the TLB. Otherwise, the system won't behave properly. Note that this dirty bit means the corresponding page is dirty, not the address translation in the TLB is dirty, nor that a particular block in the data cache is dirty. Typically, if a page is dirty it is just discarded and replaced with the page having the new line required by the CPU.
A conventional solution for lowering bit time is to use the physical part of the address to index the cache while sending the virtual address through the TLB. The limitation is that a direct-mapped cache can be no bigger than the page size. To allow large cache sizes with the 4-KB pages in the System/370, International Business Machines (IBM) Corporation (Armonk, N.Y.) uses high associativity so that they can still access the cache with a physical index. The IBM 3033, for example, is 16-way set associative with a 64 byte line size, even though studies (Smith, A. J., "Cache Memories," Computing Surveys, Vol. 14, No. 3, pp. 473-530 September 1982) show there is little benefit to miss rates above 8-way set associativity IBM Corporation, Armonk, N.Y., "IBM 3033 Processor Complex: Theory of Operation/Diagrams Manual," IBM Armonk, N.Y., 1978).
One scheme for fast cache hits without the size restriction is to use a more heavily pipelined memory access where the TLB is just one step of the pipeline. The TLB is a distinct unit that is smaller than the cache to permit shorter access time. This scheme doesn't change memory latency, but relies on the efficiency of the CPU pipeline to achieve higher memory bandwidth.
Another alternative is to match on virtual addresses directly, i.e. to use the virtual addresses directly to index a cache. Such caches are termed "virtual caches". This eliminates the TLB translation time from a cache hit. Why doesn't everyone build virtually addressed caches? One reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the virtual cache to be flushed. One solution is to increase the width of the cache-address tag with a process-identifier tag (PID). If the operating system assigns these tags to processes, the virtual cache only needs to be flushed when a PID is recycled (the PID provides protection).
Another reason why virtual caches are not more universally adopted has to do with operating systems and user programs that use two different virtual addresses for the same physical address. These duplicate addresses, called "synonyms" or "aliases", could result in two copies of the same data in a virtual cache. If one is modified, the other will have the wrong value. With a physical cache this wouldn't happen, since the accesses would first be translated to the same physical cache block. There are hardware schemes, called "anti-aliasing", that can guarantee every cache block a unique physical address, but software can make this much easier by forcing aliases to share some address bits.
Traditional caches use page offsets to select cache lines. Virtual-to-real address translation is typically performed at the same time using the TLB. Given a fixed page size and line size, the number of bits available for accessing the cache data array is fixed and is usually small. When technology improves, so does the size of the cache array available. This in turn forces the cache set-associativity to increase if address translation is to be overlapped.
Other systems such as IBM RS/6000 (Bakoglu, et al., "The IBM RISC System/6000 Processor: Hardware Overview," IBM J. Research and Development, Vol. 34, No. 1, pp. 12-22, January 1990), are forced to keep part of the virtual and real addresses identical in the data cache by software. As the cache size grows beyond 64 KB, however, keeping address bits equal in virtual and real addresses implies restricted data allocation in main memory. This would also eliminate possible locations in cache that can reduce the miss ratio advantage due to set-associative cache structures.
Virtual address tags in cache structures have also been proposed. Each address in a virtual address cache directory must be tagged with its process ID, or else the cache must be purged every time task switching occurs. Even so, virtual address caches have a serious problem with synonyms. Bederman (Bederman, S., "Cache Management System Using Virtual and Real Tags in the Cache Directory," IBM Technical Disclosure Bulletin Vol. 21, No. 11, pp. 4541, April 1979) proposed a cache structure with both virtual and real address tags to avoid synonyms. Bederman uses the same bits to index both a real virtual cache. His approach is therefore limited in size and must be smaller than the page size times the associativity.
Bederman does a virtual address comparison first, but if a synonym is detected the line will be undetectable in the cache. During the next cycle, however, once the TLB has translated the virtual address to the real address (i.e., the real page number, because the offset is the same for both the real and virtual addresses; the offset is a total of 12 bits, and the least significant bits (LSBs) of the offset indicate the byte address within the line), then the real page number generated by the TLB can be compared with the real address tag in the cache to make sure that there is no synonym.
To efficiently detect synonyms and get the needed information, a closer look at cache structures is required. A similar two-level approach was proposed in Wang et al., "Organization and Performance of a Two-level Virtual-Real Cache Hierarchy," Proc. of 16th Intl. Symposium on Computer Architecture, pp. 140-148, May 1989 with a complicated L1/L2 interface, which assumes a complex L1 being able to move cache lines within L1. Wang's cache system has an L1 cache that is capable of transferring lines from one location to another within itself. The L1 cache therefore requires rather complicated hardware.