1. Field of the Invention
The present invention generally relates to computer architectures and, more specifically, to multi-level instruction cache prefetching for multithreaded processors.
2. Description of the Related Art
A common practice in high-speed computing systems such as multithreaded processors is to utilize a multi-level cache system to reduce latency during instruction fetching. The first level cache level is called the level one (L1) cache and is typically a small, high speed memory closely associated with the processor. The L1 cache usually has the lowest memory access latency of the various cache levels and contains instructions that the processor accesses frequently or is likely to access in the near future. Increased performance is achieved when instructions are stored in the L1 cache at or before the time the instructions are accessed by the processor. A level two (L2) cache is typically a memory that is larger and slower than the L1 cache, but faster than system memory. Some cache systems may employ an intermediate level (L1.5) cache between the L1 and L2 caches with latency and size somewhere between those of the L1 and the L2 caches.
Conventionally, when the processor accesses a new instruction, the fetch unit within the processing system first looks for the instruction in the L1 cache. If there is an L1 cache hit (i.e., the instruction is, in fact, present in the L1 cache), then the instruction is transferred, and the memory access operation is executed. If the instruction is not in the L1 cache, then there is an L1 cache miss, and the fetch unit has to attempt to find the instruction in the L1.5 cache. If there is a miss in the L1.5 cache, then the fetch unit next looks for the instruction in the L2 cache, and, in the event of an L2 cache miss, the system memory is searched lastly.
When instruction accesses proceed in a predictable manner, the L1 cache hit rate may be improved by prefetching cache lines from the L1.5 cache and transferring those cache lines to the L1 cache before the processor attempts to access the corresponding instructions from the L1 cache. A processor predictably accesses instructions at successive memory address locations unless a branch occurs to a non-sequential memory location. Therefore, if the processor is accessing locations at a particular L1 cache line, then the fetch unit typically prefetches from the L1.5 cache a cache line containing the memory locations immediately following the current L1 cache line. This next cache line may be called a prefetch target and is located within the L1.5 cache immediately following the L1.5 cache addresses corresponding to the current L1 cache line. If the prefetch operation is successful, then by the time the processor reaches the memory address locations immediately following the current L1 cache line, that next L1 cache line has already been prefetched from the L1.5 cache and stored within the faster L1 cache. In this fashion, successful prefetching increases the hit rate within the L1 cache. Sequential memory accesses typically result in a cache hit. A similar technique may be employed at any level within the cache hierarchy. For example, the L1.5 cache may prefetch lines from the L2 cache, and the L2 cache may prefetch lines from system memory.
In one prefetch approach, a processor may access two instructions at a time from the L1 cache, where each L1 cache line contains eight instructions. Such a technique is referred to as a “sectored” access, where each pair of instructions represents a “sector” within the L1 cache line, and each L1 cache line has four sectors. The fetch unit monitors which sector the processor accesses at any given time and uses this information to prefetch the next L1 cache line. Again, if the prefetch operation is successful, then by the time the processor consumes the last sector in the current L1 cache line, the next L1 cache has already been prefetched from the L1.5 cache and stored within the L1 cache.
One drawback of this conventional approach to prefetching instructions is that a faster processor may transfer an entire L1 cache line (containing eight instructions in this example) at one time. In such a case, the fetch unit is not able to monitor processor accesses by sector in order to prefetch additional cache lines from the L1.5 cache. Another drawback of this approach is that a faster processor may consume instructions at such a high rate, that the fetch unit is not able to prefetch L1 cache lines quickly enough, causing an increase in cache misses. To counter this second problem, the fetch unit may prefetch two L1 cache lines ahead from the L1.5 cache in an attempt to fill the L1 cache lines before the processor accesses those lines. However, in the event of a branch to a non-sequential location, a processor typically ends up incurring two or more cache misses, one for each of the first two cache lines at the branch target, for every branch executed rather than just one cache miss. Consequently, cache misses are increased, thereby decreasing overall performance.
As the foregoing illustrates, what is needed in the art is a more optimized way to prefetch instructions in a system having a multi-level instruction cache hierarchy.