In general, random access memory (RAM) integrated circuits (ICs) operate far slower than advanced microprocessors. Designers of microprocessors have discovered that slow RAM access times present a significant impediment to increasing processor throughput and the speed of program execution. For example, advanced reduced instruction set computer (RISC) microprocessors may use a clock speed of 20 MHz or more so that theoretically 20 million instructions per second are executed. However, typical RAM cycle time (response time) is 150 nanoseconds. Therefore, when a microprocessor needs to retrieve a new data value or instruction from RAM, many microprocessor cycles can be wasted while the microprocessor waits for the RAM to respond.
To overcome this bottleneck, microcomputers and microprocessors use cache memories to improve memory access time. Cache memory is similar to virtual memory in that a high-speed cache memory stores a duplicate copy of an active portion of a low-speed RAM. Typically cache memory is located on the microprocessor chip so that access time is four to twenty times faster than main memory.
In operation, when a memory request is generated by the microprocessor, the request is presented to the cache memory, and if the cache cannot respond, the request is then presented to main memory. When the microprocessor attempts to access an item that is not in the cache, but is resident in main memory, a "cache miss" occurs. In response to a cache miss, the cache is updated by loading the needed data from main memory into the cache. The data is then fed from the cache to the microprocessor. The line in a cache not containing the desired data or instruction is called a "victim" line.
The time available for updating the status of a cache during a cache miss is miniscule. Therefore, caches are controlled by hardware that can process cache misses automatically within the required time. Unfortunately, in prior art circuits every cache miss requires updating the cache by accessing main memory, which significantly slows microprocessor throughput. Therefore, one goal of the present invention is to improve throughput by reducing the number of main memory accesses occurring after a cache miss.
Caches have been constructed in three principal types: direct-mapped, set-associative, and fully-associative. Details of the three cache types are described in the following prior art references, the contents of which are hereby incorporated by reference: De Blasi, "Computer Architecture," ISBN 0-201-41603-4 (Addison-Wesley, 1990), pp. 273-291; Stone, "High Performance Computer Architecture," ISBN 0-201-51377-3 (Addison-Wesley, 2d Ed. 1990), pp. 29-39; Tabak, "Advanced Microprocessors," ISBN 0-07-062807-6 (McGraw-Hill, 1991) pp. 244-248. These references are well known to those skilled in the art.
In all three types of caches, an input address is applied to comparison logic. Typically a subset of the address, called tag bits, are extracted from the input address and compared to tag bits of each cache entry. If the tag bits match, corresponding data is extracted from the cache. The general structure and processing of a direct-mapped cache are shown in FIG. 1. The direct-mapped cache includes a cache store 10 which can be implemented as a table comprising a plurality of tags 12 and data elements 14. Tags and data are accessed as a pair. An input address 20 is fed from a microprocessor to an address decode circuit 30 which separates tag bits from the input address 20. The tag bits are fed as a first input 42 to a comparator 40. The comparator also receives a second input 44 which comprises tag bits from the cache store 10 at a location pointed to by the low order bits of the input address. Thus, the low order input address bits point to a unique tag in the cache store. If a match is found by the comparator 40, then the comparator activates (or "asserts") its hit output 60, causing a data select circuit 70 to read a data element 14 from the cache store. Since the tags and data elements are arranged in pairs, the data select circuit receives the data element corresponding to the matched tag. The selected data is fed as output 80 from the cache memory to the microprocessor for further processing.
If no match is found between the first input 42 and the location in the cache store pointed to by the low order bits of the input address, the comparator asserts its miss output 50. This triggers miss processing (represented by block 55) which, in most prior art devices, requires accessing main memory.
In general, direct-mapped caches provide fastest access but requires the most time for comparing tag bits. Fully-associative caches provide fast comparison but consume higher power and require more complex circuitry.
In the prior art, caches have been susceptible to "thrashing." Thrashing results when the microprocessor repeatedly seeks a desired data item, fails to find it, updates the cache with the item, and then later replaces the line with a different item. This causes a cycle of repeated cache misses relating to the same data items.
Avoidance of thrashing is usually accomplished by using a large set-associative cache. However, such caches have several disadvantages. First, reading and writing to the cache occurs in a separate machine cycle executed after all cache tag compare operations, which increases cache access time. This also reduces microprocessor performance by lengthening machine cycle time or requiring delay cycles. Second, power dissipation in the cache increases significantly, because in an n-way set-associative cache, n times as many words have to be read before a selection is made for a given word based on tag compares. Third, cache operation is highly susceptible to circuit switching noise when multiple sense amplifiers trigger simultaneously. This susceptibility is acute in machines with word sizes of 32 bits or more, as well as in machines with multi-port caches in which multiple words can be simultaneously read through a plurality of cache ports.
Thus, one goal of the present invention is to provide a cache with thrash avoidance without using a set-associative implementation.
A prior researcher has discussed use of a victim cache or miss cache as a backup to a primary direct-mapped cache. An architectural definition of secondary caches (termed "victim/miss caches") to avoid thrashing and improve hit ratio is disclosed in N. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," Proceedings of the 17th Annual Int'l Symposium on Computer Architecture, IEEE Computer Society Press, May 1990, pp. 364-373, the contents of which are hereby incorporated by reference. However, Jouppi fails to teach a VLSI implementation of secondary caches, and also fails to teach integration of primary and secondary caches using shared bitlines, sense amplifiers, and bus interface logic.
In the past, cache memory has also been used for caches of instructions rather than data. In the prior art, instruction caches, instruction buffers, and branch target buffers have been implemented either on separate chips, or as separate modules in a single chip. Unfortunately, these prior art approaches require a large amount of bus circuitry to interconnect the chips or modules. This requires more area (either board space or silicon area) which increases system cost. Moreover, cache access time is increased significantly as a result of the need to propagate signals through additional gates and buffers.