This present invention relates generally to a cache in a microprocessor system, and more particularly to a method and apparatus for an assist cache.
As processor speed continues to increase at a faster rate than memory speed, memory speed has become increasingly critical. A cache is a type of buffer that is smaller and faster than the main memory. The cache is disposed between the processor and the main memory. To improve memory speed, the cache stores a copy of instructions and data from the main memory that are likely to be needed next by the processor.
Caches can use a single buffer or multiple buffers with each buffer having a different speed or latency. Latency is the number of clock cycles required to access the data or instructions stored in the memory or cache.
In the past, microprocessors were constructed with single-cycle on-chip caches to reduce memory latency. In many current high performance microprocessors, instruction and data memory references now require two clock cycles, instead of a single clock cycle. Consequently, the execution unit of the processor requires additional states to access the memory, which increases both the amount of hardware and the branch penalty. Increasing the amount of hardware increases power and cost. Increasing the branch penalty reduces performance.
Since performance is critical in any processor application, branch prediction and branch target buffers are used to reduce the branch penalty, which incurs yet more hardware. However, embedded processors target design goals towards improving performance (speed) per dollar and performance per watt instead of raw performance. Adding more pipeline stages in the execution unit and increasing the amount of hardware is not a satisfactory solution for meeting the requirements demanded by embedded processors.
The cache stores the instructions that were copied from the main memory in cache lines. A cache line may store one or many consecutive instructions. Each cache line also has a tag that is used to identify the memory address of the copied instructions. In its simplest form, a tag is the entire address. When a cache line stores multiple instructions, the entire address does not need to be stored. For example, if a cache line stores eight bytes, the three least significant bits of the address need not be stored in the tag.
A cache hit occurs when a requested instruction is already stored in the cache. A cache miss occurs when the requested instruction is not stored in the cache. Typically, when a cache miss occurs, the execution unit must wait or stall until the requested instruction is retrieved from the main memory before continuing the execution of the program causing processor performance to degrade. The number of cache hits and misses is used as a measure of computer system performance.
Multi-level caches typically have two buffers, that will be referred to as L0 and L1 caches, which have different speeds or memory latency access time. Typically, the L1 cache is slower than the L0 cache. The L1 cache receives instructions and data from the main memory. The L0 cache typically receives instructions and data from the L1 cache to supply to the execution unit.
The cache lines of the caches can also be direct mapped, fully associative or set associative with respect to the memory addresses. A fully associative cache does not associate a memory address with any particular cache line. Instructions and data can be placed in any cache line. A direct mapped cache associates a particular cache line with each memory address and places the instructions or data stored at a particular address only in that particular cache line. A set associative cache directly maps sets or groups of consecutive cache lines to particular memory locations. However, within a set of cache lines, the cache is fully associative.
Direct mapped caches have the fastest access time but tend to develop hot spots. A hot spot is a repeated miss to the same cache line. Fully associative caches have a better hit rate than direct mapped caches, but have a slower access time than direct mapped caches. With respect to access time, set associative caches are between direct mapped caches and fully associative caches.
A victim cache is a fully associative cache that stores cache lines that were displaced from the L0 cache. A victim cache line is a cache line in the L0 cache that is being replaced. In one cache system, on every cache miss, the victim cache line is copied to the victim cache. If the victim cache is full, then the new victim cache line replaces the least recently used cache line in the victim cache. When a cache miss occurs in the L0 and L1 caches, the cache determines if the requested instruction or data is stored in the victim cache, and if so the cache provides that instruction from the victim cache to the execution unit.
The victim cache improves performance by storing frequently accessed cache lines after at least one cache miss occurred. Another technique is also used to prevent cache misses. A prefetch technique uses a stream buffer to fetch instructions and data into the cache before successive cache misses can occur. When a miss occurs, the stream buffer prefetches successive instructions beginning with the missed instruction into the stream buffer. Subsequent cache accesses compare the requested address not only to the addresses or tags of the L1 and L0 caches but also to the tag of the stream buffer. If the requested instruction is not in the L1 or L0 caches but is in the stream buffer, the cache line containing the requested instruction is moved from the stream buffer into the L1 or L0 caches. However, the stream buffer uses additional space on the chip and increases power consumption.
Energy efficiency is important. The victim cache and the stream buffer both improve performance, but also increase the hardware complexity, cost and power consumption.
Thus, a cache configuration that solves the problem of providing an acceptable cache hit rate while maintaining single-cycle access latency is desirable. Furthermore, it is also desirable that the proposed cache configuration reduce power consumption and hardware requirements, which are critical design constraints for embedded microprocessors.
Therefore, it is an object of the present invention to provide an improved apparatus for and method of operating a cache.
It is a related objective of the invention to create an improved method and apparatus for reducing power consumption of the cache.
These and other objectives and advantages of the present invention are achieved by utilizing a multilevel cache with an assist cache and assist filter. The assist cache stores both displaced cache lines (victims) from the L0 cache and prefetches cache lines from the L1 cache. In one embodiment, a particular mix of victim cache lines and prefetch cache lines in the assist cache is hard-wired. In an alternate embodiment, the particular mix of cache lines is dynamically allocated.
More particularly, an L1 cache receives instructions from an external memory. An L0 cache has a first predetermined number L0 of cache lines for receiving instructions from the L1 cache. An assist cache has a victim cache and a prefetch cache. The victim cache stores a second predetermined number VC of cache lines, and the prefetch cache stores a third predetermined number PC of cache lines. The victim cache receives instructions from the L0 cache. The prefetch cache receives instructions from the L1 cache. A victim filter stores a fourth predetermined number VF of addresses wherein VF is a function of L0 and a number of cache writes. The number of cache writes to the L0 and the victim cache is reduced relative to using the L0 cache without the assist cache.
Other features and advantages of the present invention would become apparent to a person of skill in the art who studies the present invention disclosure. Therefore, a more detailed description of a preferred embodiment of the invention is given with respect to the following drawings.