The large performance gap between CPU and main memory has made the use of a cache an important factor for any future high performance processor, either microprocessor, mid-sized or large system. However, a cache is valuable only if the total average time to access the cache is much less than that to access main memory for the same data. This total average time includes the usual access time when data is resident in the cache plus a weighted average time for reload on a cache "miss".
Caches are typically designed from ordinary memory arrays and any other necessary or desired functions are done elsewhere, with complex logic as needed. Since the required functions are usually not well integrated, extra path delays, chip crossings, and additional circuits are required. The present invention teaches how to build cache array chips with simple, well integrated functions which greatly enhance the performance of the overall memory system with minimal amounts of additional hardware. The major improvement provided by this approach is the minimization of average system performance degradation due to a cache "miss" while still providing the means for fast access during a normal read/rewrite cycle.
The general approach taken in the industry to improve cache performance has traditionally been to use ordinary fast RAM chips and interface these close to the processor for speed. However, this is far from the ideal solution. The stringent and often conflicting requirements on the cache bandwidth for servicing the processor and minimizing reload time an severely limit the attainable performance.
There are a number of conflicting requirements which a cache array must fulfill in order to provide the necessary functions with high performance. Achieving these with standard array designs typically leads to a rather complex system. The complexity and resulting high costs can be substantially reduced by understanding the functions which are required and properly integrating these functions into the array chips. In order to understand this, fundamental accessing problems and typical methods of implementation must be considered. To do this a very common type of cache organization will be used as an example, implemented with relatively simple, single port array chips. Then a slightly more complex array chip with still single ported, followed by a similar structure but using a true two-port array (can support two simultaneous accesses to different addresses) will be discussed. This will clearly show the external complexity required to implement the full cache array structure. It will then be shown how various parts of this complexity can be simplified by more judicious design of the cache chip. From this discussion, it will be seen how the present invention is actually an ultimate simplification of the presented approaches which can be used for micro, mini, or large computers. It will be seen that not only is a two-port array not needed, but a two-port array without other function added is inadequate as well as costly and, therefore, a poor design/performance trade off. It is assumed in the following discussion that the path between cache and the CPU is one logical CPU word per cycle. For complex architectures where multiple arguments are fetched simultaneously on the same cycle, a two-port array may be useful, but does not change the design issues with respect to the integrated functions for reload and performance as discussed below.
In order to achieve a low-cost, high-speed cache, designers typically use a late-select, set-associative cache organization. This provides a means of starting several possible logical functions in parallel and then deciding later in the cycle which one is correct. The correct one is then used in subsequent stages of the pipeline.
A late-select cache organized to be addressable as four-way set-associative will first be considered. Referring to FIG. 1, during a normal Read access, part of the virtual address is used to select the four possible words that could be correct, namely, the congruence class which consists of one word from each set A, B, C, D. Simultaneously, the total virtual address is translated via the translation look aside buffer, TLB and cache Directory to see which set, if any, is the correct target of the access. If one is chosen, then "late" in the cycle, the correct word is enabled in one of these sets by an appropriate late-select signal and placed on the CPU-cache data bus.
Typically, the data outports of any chip are implemented with tri-state drivers so the four words can be dot-ORed together. These drivers have a data enable signal so that one can be enabled and only one is ever placed on the CPU-cache data bus. During a write-access, a problem is encountered. Typical high-speed static FET memory chips require that the data be valid at the chip boundary before the chip access is initiated. For a late-select cache design, this presents a problem since it is desired to start the chip access in parallel with the translation so the data cannot possibly be valid until the translation is complete. Typical caches use a read-modify-write operation which requires two cache cycles--one to read, one to modify and write back--which reduce system performance accordingly. Functionally, it is desirable to have a cache which can perform a late-write operation. This is possible to do without impacting the cache performance but requires special design of the chip.
Whenever the translation indicates a cache "miss", the cache block must be fetched from main memory and loaded into the cache. For complex instruction set computers (CISC) which require a relatively large number of processor cycles per instruction executed, there are often enough free cycles to allow this reload process to be relatively slow with a tolerable degradation on system performance. However, the trend in processor design has been to reduce the number of processor cycles per instruction executed and such designs place severe demands on the overall memory sub system bandwidth. For instance, published reports [see FIG. 6 of REF. 2] have shown that for a high performance processor pipeline designed to achieve an average of approximately 1.25 cycles per instruction executed, assuming an ideal memory system (e.g. infinite cache) the reload penalty for a finite cache at a typical design point can be an average of 5 percent reduction in Million Instructions Per Second (MIPS) executed by the processor for each additional cycle of reload. Thus for high-performance systems, this typically requires that the reload take place as quickly as possible. Since the memory access time is generally some fixed value, additional performance is obtained by reloading multiple words on each cycle, once the first main memory access has started.
However, loading multiple words into the cache on a "miss" presents several problems. First, the reload requires that all the words be placed into contiguous logical locations in the "same set" i.e. all the words go to set A, B, C, or D. For instance, suppose N words are reloaded on each cache cycle, as in FIG. 2, then all the N word I/O ports must now be connected somehow to one of the sets. There are some additional complicating requirements, e.g. on a "miss", the reload should start on the word that caused the "miss" so that it can be "loaded-through" to the CPU for processing in parallel with the reload. All of these requirements are very different from the normal access shown in FIG. 1 where one word from "each set" is accessed, on a word boundary. These conflicting accessing requirements create problems in cache design. A number of examples of how these requirements can or have been met starting with simple arrays, progressing to more sophisticated designs will now be set forth.
First assume that a cache array chip is available which allows an array design of one CPU logical word per chip unit, where the chip unit making up the word can be one chip or several chips. The number of chips in the partitioning used to obtain a logical word is a function of the cache size and chip modularity which is not important in this discussion. In the following description, for simplicity, a "chip unit" will be bused to supply one word to or from the CPU; but it should be understood that more than one chip can be implied, and in various configurations as will be well understood by those skilled in the art.
With such a chip unit, it is possible to build a two, four, or more-way set-associative, late-select cache which can also reload multiple words on each cache cycle (e.g. two, four, or more words per cycle). The manner in which this is achieved is a function of the relation of the required overall cache capacity and number of bits per chip that the technology can provide, i.e. the modularity of the chips and system. In order to understand the problems and trade offs, assume that a single port chip unit with a one word I/O port is being considered. Further, assume that the desired total cache capacity and available chip unit density are such that a total of eight chip units are required. If the set associativity is four-way then these eight chip units will map to two chip units per set of the associativity, as shown in FIG. 3. In such a case, it is possible to reload a maximum of two words per reload cycle since each set has two chip units and hence two independent I/O ports available to main memory. On a normal CPU access, the word address bit will access one of the two rows of chip units and one word from each of these four chip units will be accessed and held at the edge of the chip unit, one word for each of the four sets, i.e. the congruence class. The late-select signal will select one of the four, and it will be placed on the CPU data bus via a CPU multiplexer. This multiplexer is obviously necessary since the I/O lines out of each chip cannot be dot-ORed except as shown, even though they are from tri-state drivers. The reason for this is because, for reload, two separate words traverse between main memory and the array, one word for each row. Thus the off-chip multiplexer is necessary.
Of course, one serious drawback is that this structure cannot support a simultaneous reload and a CPU access because of the one-port design of the chip units. This results in access interferences and degradation of performance which can be reduced by a separate interface to main memory for reload. However, another limitation which is not a result of the one-port design is that this configuration cannot support any more than two words of reload per cycle. For instance, if a reload of four words per reload cycle were desirable, such a configuration could not support this, even if the chip units were two-ported arrays with one of the ports used for a separate bus to main memory for reload. This results from the fact that each set A, B, C, or D is contained on only two chip units with one word I/O per chip unit or two words maximum per set for reload. Thus a simple way to increase the reload path width would be to add additional chip units. The use of sixteen chip units with four per set would provide a four-word reload path as desired. However, the cache capacity has been doubled, which increases the cost, package size and delay, and is not typically acceptable.
The fundamental design problem is that technology improvements increase the array bit density per chip faster than the system requirements increase the cache capacity. The net result is that the number of chip units per system has greatly decreased with time, and this trend will probably continue in the industry. Thus the designs of the "next" system typically have fewer chip units available and a potentially smaller reload path. For instance, suppose that for the next generation design of the previously described cache, the chip density is increased by a factor of four while the cache capacity is increased by a factor of two. Hence, only four chip units are required instead of eight, so the organization would only provide a one word reload path which is very undesirable. Obviously, if the I/O path of each chip unit were increased from one to two words, a two-word reload path would be possible. However, this is achieved at considerable expense since only a one-word path to the CPU is desired, thus wasting the most important parameter, namely, cache bandwidth. Other solutions are possible as will be seen below.
In the previously described design, there are a total of eight chip units so that during a reload, there are six potential I/O ports which are sitting idle. These could be used for increasing the reload path width if it were possible to spread the words of each set over each chip unit. Then during reload, one word can be reloaded to each chip unit for a maximum of eight possible words reloaded per cycle in this case. However, this improvement requires a special type of mapping of the logical cache blocks (the replaceable unit) to the physical array structure, sometimes referred to as "Latin Square mapping" [references 3 and 4]. This mapping, using the cache chip unit described previously for a four-way set-associative, late-select cache design would require only four chip units. (Additional groups of four could be added above these with appropriate interfaces). The need for this rather complex mapping arises from the lack of an adequate reload interface coupled with the small number of chip units needed for a typical cache.
Stated generally, "Latin Square mapping" would be utilized in the following manner to achieve multi word cache reload. During normal accesses, the same address is applied to all chip units in order to access the congruence class, composed of the corresponding word from each set, e.g. word one from each set A, B, C, and D. Thus these four words must be at the same address on each chip unit. Likewise with words 2, 3, 4, etc. However, during reload, only one word can be written into each chip unit and it is desired to reload, for example word A0, A1, A2, and A3 on the same cycle. Obviously this can be done only if each of these words is on a different chip unit, as shown schematically in FIG. 4. Likewise with all other groups of four words.
The Latin Square mapping provides the proper distribution of words on the chip units, but two problems are encountered during reload. First, contiguous words of any given block are stored at different address on each chip unit, hence each chip unit must receive a different (partially different) address. Since the starting address will depend on which set is being reloaded, the addressing logic and bus for reload are much more complex. A further complication arises in that a given word from main memory, e.g. word one can reside on any of the chip units, depending on the set being reloaded, hence a ring-shift data aligner is required between the cache and the main memory. Additional complexity is introduced in the late-select logic. Since the words of any set can be on any chip unit, the late-select signal must choose not only the set, but must also match the appropriate word and set according to the Latin Square mapping, in order to enable the correct chip unit. A final complication arises from the single word data path I/O of each chip unit. During normal access, the four words must merge into one word to/from the CPU. On reload, the four words must be separate, allowing one word I/O to each chip unit. For the assumed chip unit described previously, this would require an off-chip multiplexer of some sort. While there are many ways and places for providing this function, it must be somewhere. If this function is placed on a separate chip as in FIG. 4, the extra chip crossing and multiplexer logic delay are added in the most critical access path which is extremely undesirable. Ideally, this multiplexer function should be done on the existing chips and the delay should be overlapped with each other. The functionally integrated chip architecture proposed by the present invention totally eliminates such multiplexer and delay as will be seen subsequently.
The additional circuitry required by this organization for accessing is only one aspect of the total problem. Another problem is that even though the reload bandwidth has been improved, it is still far from ideal. Since there is only one I/O port on each chip, then only one access, either for a normal CPU cycle or for a reload cycle is possible on each system clock. A "miss" and subsequent reload typically starts at the word causing the "miss"--this word is immediately loaded--through to the CPU and the CPU resumes processing. If the next CPU cycle requires a cache access, either the CPU or the reloading must wait, with appropriate logic controls for sensing and restarting. Regardless of which alternative is chosen, either CPU or reload-wait, the overall system performance is degraded. Further degradation is encountered from the same access interference problem if a store-in cache is used. The latter means that the cache contains the latest copy of the correct data so that if any changes have been made to a block, it must first be written back to main memory before it can be removed from the cache. With high performance systems, a store-in cache is a better cost performance design hence this produces many more opportunities for access interference and degradation. There are a number of ways in which the Latin Squares mapped chip may be improved. The first involves simple additions to the cache chip.
In order to minimize critical path electrical delays as well as simplify the overall busing, a few simple functions can be added to the cache chips without disturbing the array design, i.e., the functions are added entirely on the periphery of the array boundary so that the one-port array structure remains intact. The external boundary of the chip unit is assumed to have a separate one-word bus to the CPU and another one-word bus to the main memory, but there is only one path into the internal array. This is achieved by placing the multiplexer function (MUX boxes in FIG. 4) on the chip unit. Additional improvements are obtained by also including the "load-through" path, all of which help remove some of the limitations of the above organization. Since the storage array itself is still only a one-port design, this will minimize cost and maximize bit density. The small amount of multiplexing necessary to achieve this may be included on-chip without much difficulty. The functional structure of such a chip unit is illustrated in FIG. 5 (ignore the Store Back Buffer for this discussion). It should be noted that the ring shift aligner described above required for providing the proper addresses to the chips is not on the chip but is part of the main memory interface. In addition, if a Load-Thru Buffer is also added as shown in FIG. 5, additional improvement can be attained in some limited cases. For instance, if a "read" access to word A0 causes a "miss" and each chip has a Load-Thru Buffer, then the four words A0, A1, A2, and A3 are loaded both into the array and the buffer. Thus these words are available from these buffers on subsequent cycles. Note that since four words are reloaded each cycle, only the first group of four would be loaded into the Load-Thru Buffers, the subsequent reloading words go only to the array. If sufficient logic is included to identify these first four words, any of them could be loaded-through to the CPU via the load-through path as needed. On the next cycle after loading-through word A0 if the CPU accesses word A1, A2, or A3, it can be fetched from the load-through buffer on the appropriate chip, without interfering with the reload of the second group of four words from main memory into the cache. Of course, if the CPU writes to any of these words, or if the access is to a word other than one of any of the Load-Thru Buffers, then an interference is encountered. Such Load-Thru Buffers can be valuable for instruction fetches, which tend to be sequential; since data fetches tend to be more random, the Load-Thru Buffer will be of some, but limited value. Even for sequential instruction fetches, the Load-Thru Buffer does not necessarily eliminate interference between reloading and the CPU accesses. For instance, suppose the word causing the "miss" was A3. A subsequent reload will put A0, A1, A2, and A3 into the Load-Thru Buffer. A sequential instruction fetch will next access A4 which is not in any Load-Thru Buffer and must wait for the next reload cycle and access to the array itself. The next instruction fetch toward A5 will not have this word in the Load-Thru Buffer, hence an interference. Of course, complex logic could be used to latch the second set of four reloading words into the Load-Thru Buffers in this case, but the cost is high and the reward is small. The functionally integrated cache of the present invention is a significantly better design, as will be seen. In all cases, if access is permitted to partially loaded blocks, this will require word-valid flags in logic in the CPU to know which words are accessible.
It can be seen from the above discussions that the interface between cache and main memory is quite different from that between cache and CPU. Since these two interfaces are best satisfied with two buses operating with different addresses, it would seem appropriate to use an array which is truly two-ported allowing two simultaneous random accesses to the array. While this is possible, we will see that this provides more than what is required for some of the problems, and not enough to solve all the problems, in other words it is not the ideal solution in so far as cost for improved performance is concerned.
The cache organized using a two-port chip with one port interfaced main memory for reload, the other interfaced to the CPU for normal accesses, as shown in FIG. 6, will now be considered. For reload, a separate ring shift-data aligner and shift logic, plus a separate address bus per chip unit with address logic are still required much as before. Since each I/O port has a separate address input, the CPU address can be separate, with one bus to all chip units. The two ports eliminate the need for separate multiplexers on the data bus, they are built into the additional complexity of the cells and separate word/bit lines and decoders. The Load-Thru Buffers are no longer needed since once a word has been loaded, random access is available via the CPU port, if logic is retained in the CPU for specifying which words have been reloaded. However, the load-through path may still be needed, even though a two-ported array is used. The reason is that a two-port cell design is considerably simpler if a write and simultaneous read are not permitted to the same cell, i.e., do not read and write to the same word simultaneously. If this is the case, then either a separate load-through path is required, or an extra cycle of delay is encountered before the CPU can restart. In addition, the logic for comparing the addresses for the two ports and granting access must be done in the CPU; the cache is a slave and will produce errors if used improperly. The store-back of modified blocks which added to the reload time as described previously is no different for the two-port organization. Consider the fact that a two-port cell/array design itself, without including the additional drivers, decoders, and other logic which is necessary, consumes approximately thirty to fifty percent more area than a one port design and will be slower, then an enormous price is paid and very little is gained in return. Thus this type of two-port cache design is definitely not a good choice.
It is clear from the above discussion that all of the currently known cache architectures which attempted to solve various cache delay/interference problems suffer in a number of respects as outlined above. What is provided by the present invention is an improved cache architecture wherein significant functional capability can be mounted directly on chip with the cache storage units without significantly interfering with the storage cell design itself. Such a design should have significantly improved reload capabilities as well as improved store-back capabilities.