A preferred embodiment of the present invention is incorporated in a supercalar processor identified as xe2x80x9cR10000, xe2x80x9d which was developed by Silicon Graphics, Inc., of Mountain View, California. Various aspects of the R10000 are U.S. Ser. Nos. 08/324,128, 08/324,129 and 08/324,127, all incorporated herein by reference for all purposes. The R10000 is also described is J. Heinrich, MIPS R10000 Microprocessor User""s Manual, MIPS Technologies, Inc. (1994).
A microfiche appendix containing one sheet and forty-eight frames is included as Appendices I and II to this application and is hereby incorpated by reference in its entirety for all purposes. The microfiche appendix is directed to Chapters 16 and 17 of the design notes describing the R10000 processor.
This invention relates in general to computers and in particular, to cache memory.
CPU designers, since the inception of computers, have been driven to design faster and better processors in a cost-effective manner. For example, as faster versions of a particular CPU becomes available, designers will often increase the CPU""s clock frequency as a simple and cost effective means of improving the CPU""s throughput.
After a certain point, the speed of the system""s main memory (input/output) becomes a limiting factor as to how fast the CPU can operate. When the CPU""s operating speed exceeds the main memory""s operating requirements, the CPU must issue one or more wait states to allow memory to catch up. Wait states, however, have a deleterious effect on CPU""s performance. In some instances, one wait state can decrease the CPU""s performance by about 20-30%
Although wait states can be eliminated by employing faster memory, it is very expensive and may be impractical.
Typically, the difference between the price of a fast memory chip and the next fastest speed grade can range from 50-100%.
Thus, the cost can be quite prohibitive, especially for a system requiring a large memory.
A cost effective solution has been to provide the CPU with a hierarchical memory consisting of multiple levels of memory with different speeds and sizes. Since the fastest memories are more expensive per bit than slower memories, they are usually smaller in size. This smaller memory, referred to as a xe2x80x9ccachexe2x80x9d, is closely located to the microprocessor or even integrated into the same chip as the microprocessor.
Conceptually, the memory controller retrieves instructions and data that are currently used by the processor and stores them into the cache. When a processor fetches instructions or data, it first checks the cache. The control logic determines if the required information is stored in the cache (cache hit). If a cache hit occurs, the CPU does not need to access to main memory. The control logic uses valuable cycles to determine if the requested data is in the cache. However, this cost is acceptable since accesses to main memory is much slower.
As can been seen, the higher the cache xe2x80x9chitxe2x80x9d rate is, the faster the CPU can perform its duties. obviously, the larger the cache, the more data it can store, and thus, a higher probability of a hit. However, in the real world, microprocessor designers are always faced with size constraints due to the fact that as there is limited available space on a die. Using a larger die size, although effective, is not practical since the cost increases as die size increases. Further, reducing the size of the cache without reducing the performance allows the designer to improve the performance of other functional units of the CPU.
Thus, there is a need for designing a cache that can determine if a hit has occurred using a minimum number of cycles and a high hit rate while reducing space needed on the chip.
The present invention offers a highly efficient mechanism for implementing cache memory in a computer system. This mechanism enables the cache memory to have high a xe2x80x9chitxe2x80x9d rate, fast access time, low latency, and reduced physical size.
In one embodiment, the present invention provides a cache which operates in parallel with the translation lookaside buffer to reduce its latency. The cache contains two 2-way set-associative arrays that are interleaved together. Each 2-way set-associative array includes two arrays, one each for the tag and data. By having four independently operating cache arrays, up to four instructions can operate simultaneously. The bits in each data array are interleaved to allow two distinct access patterns. For example, when the cache is loaded or copied back, two double words in the same block are accessed simultaneously. When the cache is read, the same doubleword location is simultaneously read from both blocks with the set. Further, by using a multiplexer, the number of sense amplifiers for reading and writing are reduced, thereby saving significantly valuable space on the die.
A better understanding of the nature and advantages of the present invention may be had with reference to the detailed description and the drawings below.