Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, accounting, e-mail, voice over Internet protocol telecommunications, and facsimile.
Users of digital processors such as computers continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. In addition, processing speed has increased much more quickly than that of main memory accesses. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner. Many modern computers also support “multi-tasking” or “multi-threading” in which two or more programs, or threads of programs, are run in alternation in the execution pipeline of the digital processor. Thus, multiple program actions can be processed concurrently using multi-threading.
At present, every general purpose computer, from servers to low-power embedded processors, includes at least a first level cache L1 and often a second level cache L2. This dual cache memory system enables storing frequently accessed data and instructions close to the execution units of the processor to minimize the time required to transmit data to and from memory. L1 cache is typically on the same chip as the execution units. L2 cache is typically external to the processor chip but physically close to it. Accessing the L1 cache is faster than accessing the more distant system memory. Ideally, as the time for execution of an instruction nears, instructions and data are moved to the L2 cache from a more distant memory. When the time for executing the instruction is near imminent, the instruction and its data, if any, is advanced to the L1 cache. Moreover, instructions that are repeatedly executed may be stored in the L1 cache for a long duration. This reduces the occurrence of long latency system memory accesses.
As the processor operates in response to a clock, an instruction fetcher accesses data and instructions from the L1 cache and controls the transfer of instructions from more distant memory to the L1 cache. A cache miss occurs if the data or instructions sought are not in the cache when needed. The processor would then seek the data or instructions in the L2 cache. A cache miss may occur at this level as well. The processor would then seek the data or instructions from other memory located further away. Thus, each time a memory reference occurs which is not present within the first level of cache, the processor attempts to obtain that memory reference from a second or higher level of memory. When a data cache miss occurs, the processor suspends execution of the instruction calling for the missing data while awaiting retrieval of the data. While awaiting the data, the processor execution units could be operating on another thread of instructions. In a multi-threading system the processor would switch to another thread and execute its instructions while operation on the first thread is suspended. Thus, thread selection logic is provided to determine which thread is to be next executed by the processor.
A common architecture for high performance, single-chip microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution. Thus, in a RISC architecture, a complex instruction comprises a small set of simple instructions that are executed in steps very rapidly. As semiconductor technology has advanced, the goal of RISC architecture has been to develop processors capable of executing one or more instructions on each clock cycle of the machine. Execution units of modern processors therefore have multiple stages forming an execution pipeline. On each cycle of processor operation, each stage performs a step in the execution of an instruction. Thus, as a processor cycles, an instruction advances through the stages of the pipeline. As it advances, the steps in the execution of the instruction are performed. Moreover, in a superscalar architecture, the processor comprises multiple execution units operating in parallel to execute different instructions in parallel.
The L1 cache of a processor stores copies of recently executed, and soon-to-be-executed, instructions. These copies are obtained from “cache lines” of system memory. A cache line is a unit of system memory from which an instruction to be stored in the cache is obtained. The address or index of a cache entry may be determined from the lower order bits of the system memory address of the cache line to be stored at that entry. Multiple system memory addresses therefore map into the same cache index. The higher order bits of the system memory address form a tag. The tag is stored with the instruction in the cache entry corresponding to the lower order bits. The tag uniquely identifies the instruction with which it is stored.
A collision occurs when an instruction is called from system memory that maps into a cache entry where a valid instruction is already stored. When this occurs the processor makes a decision whether to overwrite the stored instruction. A way to reduce collisions is to increase the associativity of the cache. An n-way associative cache maps each one of a multiple of system memory addresses to one of n cache memory locations within a cache entry. The tag bits at a cache entry are compared to the tag derived from the program counter that points to the system memory address of the instruction. In this way, the correct instruction to be retrieved from the cache is identified. A 1-way associative cache is called a direct mapped cache and results in a high probability of collisions. In an n-way associative cache, a collision can be avoided by selectively placing the instruction from system memory into an empty one of the n locations within the cache entry. Thus, an n-way associative cache reduces collisions. Indeed, a fully associative cache might avoid collisions altogether. However, the cost and space of implementing associativity increases with increased associativity. Therefore, an improved method for reducing collisions without the cost of increased associativity is needed.