Not applicable.
The present invention generally relates to a microprocessor. More particularly, the invention relates to prefetching instructions. Still more particularly, the invention relates to prefetching data into designated ways of a set-associative cache.
Most modern computer systems include at least one central processing unit (xe2x80x9cCPUxe2x80x9d) and a main memory. Multiprocessor systems include more than one processor and each processor typically has its own memory which may or may not be shared by other processors. The speed at which the CPU can decode and execute instructions and operands depends upon the rate at which the instructions and operands can be transferred from main memory to the CPU. In an attempt to reduce the time required for the CPU to obtain instructions and operands from main memory, many computer systems include a cache memory coupled between the CPU and main memory.
A cache memory is a relatively small, high-speed memory (compared to main memory) buffer which is used to hold temporarily those portions of the contents of main memory which it is believed will be used in the near future by the CPU. The main purpose of a cache is to shorten the time necessary to perform memory accesses, both for data and instructions. Cache memory typically has access times that are significantly faster than a system""s main memory. The use of cache memory can significantly improve system performance by reducing data access time, therefore permitting the CPU to spend far less time waiting for instructions and operands to be fetched and/or stored.
A cache memory, typically comprising some form of random access memory (xe2x80x9cRAMxe2x80x9d) includes many blocks (also called lines) of one or more words of data. Associated with each cache block in the cache is a tag. The tag provides information for mapping the cache line data to its main memory address. Each time the processor makes a memory reference (i.e., read or write), a tag value from the memory address is compared to the tags in the cache to see if a copy of the requested data resides in the cache. If the desired memory block resides in the cache, then the cache""s copy of the data is used in the memory transaction, instead of the main memory""s copy of the same data block. However, if the desired data block is not in the cache, the block must be retrieved from the main memory and supplied to the processor. A copy of the data also is stored in the cache. Commonly used mapping functions include direct mapping and associative mapping techniques.
In a modern computer system, latency to move instructions and operands into the cache is a significant performance issue. One of the key optimizations to deal with this issue is xe2x80x9cprefetch.xe2x80x9d The data to be retrieved is moved into the cache before it is actually needed. This is called a prefetch. In a modem computer system, specific instruction(s) are provided to do the prefetch. The time at which to do the prefetch depends on a number parameters of the hardware, and, indeed, the program itself. Deciding when to do the prefetch is not the subject of this disclosure.
In a multi-way set-associative cache the cache comprises a plurality of buffers referred to as xe2x80x9cwaysxe2x80x9d (also called xe2x80x9csetsxe2x80x9d). Each way comprises a plurality of cache blocks and associated tags. A portion of a memory address is an xe2x80x9cindexxe2x80x9d value. The index value from the memory address references corresponding cache blocks and their tags in the cache. For example, an index of 3 may reference the third cache block and associated tag in each of the ways. A xe2x80x9cmulti-wayxe2x80x9d cache has more than one way. A cache with two ways is called a two-way, set associative cache. Similarly, a cache with four ways is called a four-way, set associative cache.
When a multi-way access is made, a comparison is made of the tags in each array corresponding to the address""s index value to a tag value in the address itself. A match means that the desired data block is currently in a particular way in the cache and the operation is performed on or with the data from that particular way.
Because the cache stores only a subset of the memory data, for each memory reference the requested data may or may not be currently stored in the cache. When the data is currently in the cache, there is a xe2x80x9chitxe2x80x9d in the cache, as described above. When the data is not already in the cache, the requested data must be fetched from memory. The data fetched from memory is used in the transaction by the CPU and a copy of that data is placed in cache for subsequent use. This condition in which the needed data is not in cache is called a cache xe2x80x9cmiss.xe2x80x9d Upon a cache miss, if the cache is currently full, space must be allocated in the cache to provide storage for the data received from main memory. Of course, if the cache is full which typically occurs shortly after system initialization as the system begins normal operation, a block of data in cache must be xe2x80x9cevictedxe2x80x9d to make room for the new data block. The process of determining which cache block to xe2x80x9cevictxe2x80x9d is called xe2x80x9callocation.xe2x80x9d
One type of allocation scheme is the Least Recently Used (xe2x80x9cLRUxe2x80x9d) technique of allocation. In LRU allocation, a temporal indicator is stored for each cache block when data is written into the cache location. This temporal indicator serves as a so-called xe2x80x9ctime-stampxe2x80x9d because it provides a temporal characteristic to the stored data. In general, the time-stamp is used to determine which data in the cache has been used least recently. The LRU allocation technique evicts the least recently used cache block to make room for the new block. The LRU technique is based on the assumption that data that has not been used recently is less likely to be needed again compared to data that has been used more recently. Often, this assumption proves to be accurate, and other times even the least recently used cache block will be needed again after eviction. Performance testing has demonstrated, however, that on balance the LRU allocation technique achieves a better latency reduction than many other allocation techniques.
In an LRU-type allocation scheme, the most recently used cache block will remain in cache until it becomes the oldest (i.e., least recently used) block in the cache. This means that, in the interim, while that block remains in cache, other cache blocks necessarily will be evicted instead. However, it may be that a block that is not evicted because it is not the least recently used block will never again be used, while other blocks that may be needed again. Unfortunately, LRU-type allocation schemes keep more recently used blocks in the cache even if they will never again be needed or at least not needed any time soon. Retaining recently used, but unneeded, blocks in cache forces the allocation logic to evict other blocks that actually might be needed, thereby reducing the performance of the system.
Accordingly, it would be desirable to have a cache system and allocation scheme that addresses this deficiency. To date, no such system is known to exist.
The problems noted above are solved in large part by a computer system having a set-associative, multi-way cache system in which at least one way is designated as a fast lane and the remaining way(s) are designated slow lanes. Any data that needs to be loaded into cache, but is not likely to be needed again in the near future preferably is loaded into the fast lane. Data loaded into the fast lane is earmarked for immediate replacement. Data loaded into the slow lanes preferably is data that may be needed again in the near future. Slow data is kept in cache to permit it to be reused if necessary. Prefetch command data is one type of data that may not be needed again and can be placed into the fast lane. By loading xe2x80x9cnot likely to be needed againxe2x80x9d data into the fast lane and designating such data for immediate replacement, data in other cache blocks in the other ways may not be evicted and overall system performance is increased.
The cache system of the present invention preferably implements two cache allocation schemes. One scheme is used to allocate the slow ways and a different scheme is used for the fast way. Further, two or more fast lanes can be provided if desired.