1. Field of the Invention
The invention relates to microcomputer systems employing cache memory, and more specifically relates to a method of and apparatus for using cache way prediction in a multiple-way cache to increase the performance of such cache memories.
2. Description of the Related Art
The field of microcomputer design has seen tremendous advances in the past decade. Microcomputer systems have undergone revolutionary changes since the introduction of the original Personal Computer by International Business Machines using an 8088 microprocessor by Intel Corporation. Systems today include multiple buses, extremely wide data paths, huge addressing spaces, and processor speeds of over 50 MHz.
In the course of improving these microcomputer systems, designers of both the chips and the systems themselves resorted to cache technology to more efficiently couple the higher speed processors with lower speed memory. These caches have evolved into multiple level caches, with both data and code caches in the microprocessor, and a second level (L2) cache coupling the microprocessor host bus to slower speed system memory.
The purpose of a cache is to eliminate wait states. A processor running at 50 Mhz only has 20 nanoseconds seconds per clock cycle, and only more expensive static random access memory (SRAM) can respond to memory operations in that amount of time without wait states. Main memory is usually implemented using less expensive dynamic random access memory (DRAM), which is slower and requires the processor to wait additional clock cycles for a response. A cache provides an intermediate store of much higher speed memory, relatively smaller than main memory, that permits a processor to run with fewer wait states when the needed data is stored in the cache memory. When the data is not stored in the cache memory, a cache controller then retrieves that data from main memory.
Cache design is well known to the microcomputer art and includes a number of configurations. Two popular configurations are a direct mapped cache and a multiple-way cache. In a direct mapped cache, a certain portion of the memory address is stored in a cache tag RAM as a tag. This tag indicates to what block of main memory the data stored in the cache data RAM corresponds. A second portion of the processor provided memory address, known as the index, then addresses the cache data RAM when the corresponding tag in the tag RAM matches a valid tag portion of the processor address bus. For example, assume a simple memory system using 16 bit addressing, providing 64 kilobytes of memory. Assume that the high speed cache data RAM provides 8 kilobytes of high speed cache memory. The lower 8 bits of the memory address then acts as an index into both the cache data RAM and the cache tag RAM, while the top 8 address bits are then stored in the indexed location in the tag RAM. When the microprocessor seeks to access a certain memory location, the top 8 address bits are compared to the data stored in the indexed location of the tag RAM. If the tag matches those bits, then the indexed cache data memory location contains the data from main memory corresponding to the processor provided memory address.
This example illustrates a direct mapped cache. A problem with such a direct mapped cache is that if two memory locations are accessed which are one cache page apart--i.e., they have the same index--they could only occupy the same location in the cache data RAM. Thus, on alternating accesses to such locations, the cache must be updated from or to main memory resulting in a cache miss.
To provide for a greater hit percentage, designers implemented multi-way set-associative caches. A typical example is a two-way set-associative cache. In such a cache, each cache index has two corresponding cache data RAM storage locations and two corresponding tags in the cache tag RAM. If one indexed way is already occupied by data corresponding to another tag, then the second way can be used to store and retrieve the main memory data. When yet a third location of main memory needs to be stored in the cache, various algorithms are used to determine in which way of the cache to replace data, such as a least recently used algorithm.
Both of these types of cache designs are well known in the art of microcomputer design. There are implementation details of such caches, however, that can effect system speed. Advances have become incremental in recent years, with the savings of even a single wait state becomming important to improving system performance.
In a two-way set-associative cache, one of the ways must be chosen during a microprocessor read operation. The index portion of the asserted address is latched to the cache data and tag RAMs, and then when the cache tag RAMs return signals indicating a hit or miss to a first or second way, the output enables of the appropriate bank of cache data RAM corresponding to that way is enabled. But when running at 50 MHz, even the act of enabling the output buffers of the appropriate way of the cache can require a wait state. Cache designers have therefore implemented way prediction algorithms based on which way of the cache was most recently used. Then, when an address is strobed onto the processor host bus by the processor for a read operation, this most recently used way prediction logic predicts which way will be needed and enables that way s cache output enable. This occurs even before other more complicated circuitry determines which, if any, way is actually a hit, saving precious time in enabling the output of the cache data RAM.
This causes a problem, however, should the way be a miss. In such a case, the output enables of the bank that forms the first way must be disabled and only then the other way s output enables enabled. This will typically require extra wait states. But, because the way prediction logic will be correct at least half of the time, this generally improves system performance, allowing no wait state operation in at least half of the cases. But in high speed systems, the cache data RAM s output enables cannot wait on a read of the tag RAM to generate the most recently used bit. That is, the most recently used bit will not become available from the most recently used logic and the cache tag RAMs in time to enable the cache output enables in anticipation of a cache data RAM read. In such a case, only the use of extremely high speed SRAMs would allow no wait state operation.
Therefore, it would be desirable if other efficient way prediction algorithms could be developed that would allow no wait state operation in a majority of cases without using prohibitively expensive cache data SRAMs.
As noted above, a multi-way set-associative cache has historically been implemented using a separate physical bank of cache data RAM for each way, with the output enables being used for way selection.
In a two-way set-associative cache, at least two banks of memory would be required. Assuming a 64-bit-wide data bus and 32-bit-wide burst SRAMs, this would mean four physical chips. It would be greatly desirable to allow implementation of multi-way caches without these minimum requirements of memory banks, allowing less expensive cache memory systems to be designed. Generally, the only alternative in the past has been using a direct mapped cache memory subsystem.
To maintain coherency between the cache memory and main memory, cache tag RAMs typically have a number of bits of information associated with the state of the corresponding data in the cache data RAM. For example, a valid bit is typically used to indicate whether a particular line of a cache contains valid data. Further, in a writeback cache, in which data written from the processor to the cache is not updated to main memory until necessary, a modified bit indicates that the processor has written to a particular line of the cache data RAM, but that line has not been subsequently written back to main memory.
These bits typically provide another source of system delays. When the processor writes to a line of cache memory in the writeback cache that has not been modified, that modified bit must be updated to reflect the modified status of the corresponding line in the cache data RAM. This will typically take an extra clock cycle except in complicated, and expensive, multi-ported cache memory systems. In a cache memory system where the memory is only accessed through one bus, however, a write to an unmodified valid line in the cache data RAM typically requires extra clock cycles to update the modified bit in the cache tag RAM.
When the processor writes to the cache and the indexed line in both ways of the cache data RAM is not valid, or if both ways result in cache tag misses, then the cache does not contain the line of memory to which the processor is attempting to write. This condition instigates an allocation cycle for writeback caches. If the line of data in the chosen way of the cache data RAM is valid but a tag miss, that data must first be written back to main memory before the appropriate line of main memory is read into the cache, whereas, if the line of data in the cache data RAM is invalid, no writeback is needed. After such a writeback, if needed, the cache memory controller allocates a line in the cache data RAM for the addressed memory location. This line of data is read from main memory, and the corresponding index location of the tag RAM is updated to include the tag and the various valid and modified bits. A read allocation cycle is typically just a cache linefill, so the corresponding valid bits are set to true and the modified bits would be set to unmodified. The processor write operation that instigated the allocate is then permitted to restart, whereupon the appropriate data is written into the cache and the modified bit for that line of the cache is then asserted.
But as noted above, the assertion of this bit requires an extra clock cycle. Therefore, in single-ported cache memory systems, a write allocation cycle typically requires an extra wait state for the cache memory controller to set the appropriate modified bits on the subsequent write from the microprocessor. It would be desirable to eliminate such wait states.