1. Field of the Invention
The invention relates to computer architectures, and more particularly to computer architectures which employ a cache RAM.
2. Description of Related Art
Modern day computer designs frequently include a very large main memory address space which interfaces with a CPU via a cache memory. Good descriptions of the various uses of and methods of employing caches appear in the following articles: Kaplan, "Cache-based Computer Systems," Computer, 3/73 at 30-36; Rhodes, "Caches Keep Main Memories From Slowing Down Fast CPUs," Electronic Design, Jan. 21, 1982, at 179; Strecker, "Cache Memories for PDP-11 Family Computers," in Bell "Computer Engineering" (Digital Press), at 263-67.
In one form, a cache memory comprises a high speed data RAM and a parallel high speed tag RAM. The location address of each entry in the cache is the same as the low order portion of the main memory address to which the entry corresponds, the high order portion of the main memory address being stored in the tag RAM. Thus, if main memory is thought of as 2.sup.m blocks of 2.sup.n words each, the i'th word in the cache data RAM will be a copy of the i'th word of one of the 2.sup.m blocks in main memory. The identity of that block is stored in the i'th location in the tag RAM. When the CPU requests data from memory, the low order portion of the address is supplied as an address to both the cache data and tag RAMs. The tag for the selected cache entry is compared with the high order portion of the CPU's address and, if it matches, the data from the cache data RAM is enabled onto the data bus. If the tag does not match the high order portion of the CPU's address, then the data is fetched from main memory. It is also placed in the cache for potential future use, overwriting the previous entry. On a data write from the CPU, either the cache RAM or main memory or both may be updated, it being understood that flags may be necessary to indicate to one that a write has occurred in the other. The use of a small, high speed cache in the computer design permits the use of relatively slow but inexpensive RAM for the large main memory space, by taking advantage of the "property of temporal locality," i.e., the property inherent in most computer programs wherein a memory location referenced at one point in time is very likely to be referenced again soon thereafter.
A cache memory architecture can be thought of as comprising three basic building blocks or modules: a unit for generating addresses (which may comprise an entire CPU), cache data and tag RAMs for storing the recently used information, and tag comparator logic for determining whether a hit or miss has occurred. In older architectures, these three modules were typically disposed on separate chips or even separate boards. This posed several problems. First, a speed penalty was incurred due to the length of the wires connecting the various chips together. This penalty is becoming more important as semiconductor memory and logic speeds increase. Second, whenever a signal is sent off-chip, the drivers are limited in their switching speed because very high currents will create too much inductive switching noise in the power supply for the remainder of the circuits to tolerate. Third, the need for many chips increases costs both because board space is expensive, and also because the total cost of many devices is greater than the total cost of a few highly integrated devices. Additionally, these older architectures were often designed to require a cache hit signal before cache data was enabled onto the data bus. Data would therefore not appear on the data bus until three delay periods were exhausted serially: the time required to read the tag RAM, the time required to compare it to the high-order portion of the address, and the time required to enable data from the data cache RAM onto the data bus.
More recently, Texas Instruments began manufacturing a chip, called the TMS2150, which includes both the cache tag RAM and the tag comparator logic together on the same chip. This chip is described in Rhodes, "Cache-Memory Functions Surface on VLSI Chip," Electronic Design, Feb. 18, 1982, at 159. The TMS2150 reduces some of the chip boundary crossings in the prior implementation, but not enough. The full memory address must still be sent out to the 2150, requiring a potentially disruptive driver for each bit. Additionally, the architecture shown as FIG. 5 of the above article continues to show data from the cache data RAMs being enabled onto the data bus only after a match is detected by the 2150.
It has also been suggested that all three of the modules described above be integrated onto the same chip. See, for example, Goodman, "Using Cache Memory to Reduce Processor-Memory Traffic," Proceedings of the 10th Annual Symposium on Computer Architecture, 6/83, pp. 124-131, at 125; VanAken, "Match Cache Architecture to the Computer System," Electronic Design, Mar. 4, 1982, at 93. Whereas this would eliminate all chip boundary crossings, it is not very practical for two reasons. First, the size of the cache RAMs would have to be too small to yield a reasonable hit rate. Second, it prevents the designer from taking advantage of advances in memory technology that occur during the computer design cycle. Regardless of what technology is chosen at the beginning of the design cycle, it will be outdated when the computer reaches the production stage. If the tag and data RAMs are implemented off-chip, whatever products were initially expected to fill those sockets could simply be replaced by the faster, denser, cheaper and cooler-running chips likely to be available when the computer reaches the production stage. This cannot be done if the RAMs are incorporated into the CPU chip.
Fairchild's "Clipper" chip set implements a similar type of organization. See Sachs, "A High Performance 846,000 Transistor UNIX Engine--The Fairchild Clipper," Proceedings of IEEE International Conference on Computer Design, 10/85, at 342-46 for a description. The Clipper chip set includes three chips: a CPU, an Instruction Cache And Memory Management Unit (ICAMMU) and a Data Cache And Memory Management Unit (DCAMMU). The ICAMMU integrates cache RAMs, a tag comparator and a translation lookaside buffer (discussed below) on one chip. It also integrates a copy of the CPU's program counter, so that instruction address information need be transmitted to the ICAMMU only on program branches. The Clipper implementation is similar to the fully integrated approach in that the address generating unit (the copy of the program counter) is on the same chip as the cache RAMs and tag comparator. But in order to make the cache RAMs as large as they are, the CPU had to be moved off-chip. Full virtual addresses must therefore cross a chip boundary from the CPU to ICAMMU whenever a branch takes place. Additionally, as with the fully integrated approach, a designer using the Clipper chip set cannot take advantage of the advances that occur in memory technology during the computer design cycle.
Read/write cycle times are further increased if the computer has virtual memory capability. In such computers, each of a number of different tasks address memory as if the other tasks were not present. In order to accomplish this, main memory is divided into blocks or "pages," one or more of which can be assigned to each task at any given time. When a task references a "virtual" memory address, the address must be translated into a real address in the proper page of main memory. Only the high order bits of the virtual address must be translated, however, since the low order bits are the same for each page. Thus, since a typical page size is 4k bytes, all but the low order 12 bits of the virtual address must usually be translated for each main memory access.
Virtual memory capability can be implemented in a cache system in any of several configurations, none of which have been altogether satisfactory. In one configuration, an address translation unit (ATU) is placed between the address generating unit and the cache memory. This configuration introduces significant overhead because every access to the cache is delayed by the time needed to go through the address translation unit. A second possibility is to set the cache length equal to or less than the virtual page length, such that only the untranslated low order address bits are needed to address it. The Clipper chip set, described above, uses this configuration. However, this usually limits cache length to a size too small to provide a reasonable hit rate. The cache size limitation can be overcome by adding set associativity (two or more tag/data RAM pairs in parallel) to the cache RAM organization, but this requires that set selection logic be added to the tag comparator logic to determine which cache data RAM to enable onto the data bus once a match is detected. This additional layer of logic further degrades performance.
A third possible configuration involves using virtual addresses to address a long, direct mapped (single set) cache, and translating the addresses to real addresses only when it becomes necessary to access main memory. But this has other problems which reduce its overall efficiency. First, in a multitasking environment, all the tasks usually address an overlapping group of virtual addresses, though these usually correspond to different real addresses for each task. The principle of locality continues to apply to each task individually, but it no longer applies to all tasks running together. A cache entry addressed and updated by one task is likely to be addressed and updated by a second task before the first can benefit from its presence nearby. Set associativity can be used to offset this problem, but many sets may be necessary to match the performance of a computer with a direct mapped real addressed cache. The necessary set selection logic also degrades performance.
Second, in situations where I/O is performed in the form of direct modification of main memory contents, it is necessary to flag the corresponding cache entry, if one exists, to indicate that it no longer contains valid data. However, since the cache is accessed by virtual addresses, the real address of the memory location modified must be reverse translated to determine which if any cache entry corresponds. The schemes employed to overcome this problem add significantly to the complexity of the computer memory control logic and the software overhead