1. Field of the Invention
This invention relates generally to the field of data processing systems, and, more particularly, to cache memory used in data processing systems. Specifically, the present invention relates to a cache memory architecture with way prediction.
2. Description of the Related Art
The demand for quicker and more powerful personal computers has led to many technological advances in the computer industry, including the development of faster memories. Historically, the performance of a personal computer has been directly linked to the efficiency by which data can be accessed from memory, often referred to as the memory access time. Generally, the performance of a central processing unit (CPU or microprocessor), which functions at a high speed, has been hindered by slow memory access times. Therefore, to expedite the access to main memory data, cache memories have been developed for storing frequently used information.
A cache is a relatively small high-speed memory that is used to hold the contents of the most recently utilized blocks of main storage. A cache bridges the gap between fast processor cycle time and slow memory access time. Using this very fast memory, the microprocessor can reduce the number of wait states that are interposed during memory accesses. When the processor issues the load instructions to the cache, the cache checks its contents to determine if the data is present. If the data is already present in the cache (termed a "hit"), the data is forwarded to the CPU with practically no wait. If, however, the data is not present (termed a "miss"), the cache must retrieve the data from a slower, secondary memory source, which may be the main memory or another cache, in a multi-level cache memory system. In addition, the retrieved information is also copied (i.e. stored) into the cache memory so that it is readily available to the microprocessor for future use.
Most cache memories have a similar physical structure. Caches generally have two major subsystems, a tag subsystem (also referred to as a cache tag array) and memory subsystem (also known as cache data array). A tag subsystem holds the addresses and determines where there is a match for a requested datum, and a memory subsystem stores and delivers the data upon request. Thus, typically, each tag entry is associated with a data array entry, where each tag entry stores index information relating to each data array entry. Some data processing systems have several cache memories (i.e. a multi-level cache system), in which case, each data array will have a corresponding tag array to store addresses.
Utilizing a multi-level cache memory system can generally improve the proficiency of a central processing unit. In a multi-level cache infrastructure, a series of caches can be linked together, where each cache is accessed serially by the microprocessor. For example, in a three-level cache system, the microprocessor will first access the L0 cache for data, and in case of a miss, it will access cache L1. If L1 does not contain the data, it will access the L2 cache before accessing the main memory. Since caches are typically smaller and faster than the main memory, the general trend is to design modern day computers using a multi-level cache system.
To further improve the performance of a central processing unit, computer architects developed the concept of pipelines for parallel processing. The first step in achieving parallel processing is to decompose the process at hand into stages. Typically, a computer executes all the stages of the process serially. This means that the execution of all the stages of the process must be complete before the next process is begun. A computer often executes the same staged process many times in succession. Rather than simply executing each staged process serially, the microprocessor can speed up the processing through pipelining, in which the stages of the repeating process are overlapped.
The concept of pipelining has now extended to memory caches as well. Pipelines can enhance the throughput of a cache memory system, where the throughput is defined as the number of cache memory access operations that can be performed in any one time period. Because caches are typically accessed serially, and can be decomposed into stages, it is possible to use pipelines to speed up the accessing process. In fact, modem data processing systems achieve even greater efficiency by applying the art of pipelining to multi-level cache memory systems.
An example of a two-level pipelined cache system is illustrated in FIG. 1, which stylistically depicts the L1 and L2 cache stages 5-30 of the Intel Pentium.RTM. Pro System Architecture. It takes three stages 5, 10, and 15 to complete an access of the L1 cache (not shown), and three additional stages 20, 25, and 30 to complete an access of the L2 cache (not shown). Each stage takes one cycle to complete. In the first stage 5, when a request for a load or store is issued, the address is provided to the L1 cache (not shown). During the second and the third stages 10, 15, the lookup takes place and, in case of a hit, the data transfer occurs. If the access is a miss in the L1 cache (not shown), then the request enters the fourth stage 20, where the address is submitted to the L2 cache (not shown). During the fifth stage 25, the lookup takes place and, if a hit, the data is transferred during the sixth stage 30. In summary, a load request that hits the L1 cache (not shown) completes in three clocks, while one that misses the L1 cache (not shown) but hits the L2 cache (not shown) completes in six clocks. If the load request misses the L2 cache (not shown), then the request is forwarded to the main memory (not shown).
FIG. 2 is a timing diagram illustrating an example of the Intel Pentium.RTM. Pro Architecture's two-stage pipelined cache being accessed by the microprocessor (not shown). As illustrated in the figure, the microprocessor (not shown) makes four different cache accesses (i.e. requests) 32-35. The first access 32 results in an L1 cache hit and, as a result, the request is completed within three stages. The second access 33, however, misses in the L1 cache (not shown), and the request is then forwarded to the L2 cache (not shown). Thus, it takes six stages to retrieve data from the L2 cache (not shown). Because the L1 and L2 caches (not shown) are pipelined, the first and the second accesses 32 and 33 complete in a total of seven clock cycles. However, in a non-pipelined cache system (not shown), this process would require nine clock cycles, because the L1 access would have to complete before the L2 access initiates. That is, the earliest the second access can initiate is during the fourth clock cycle, and not the during the second clock cycle, as it does in a pipelined cache system. The third and fourth accesses 34 and 35 are shown only to further illustrate how pipelined caches can improve the throughput of cache memories by processing multiple requests simultaneously.
As the number of levels in a multi-level pipelined cache memory system have increased, so have the number of pipeline stages required to support the added levels. Generally, the number of pipeline stages required to support a cache memory is proportional to the number of clock cycles required to access that memory. For a given frequency, a pipeline with more stages requires more circuitry, which not only adds to the expense of implementing pipelines, but also hinders performance and consumes additional power. It is therefore desirable to have a cache memory architecture that reduces the required number of pipeline stages, yet achieves equal or better performance.
In a multi-level cache system, it is not uncommon to find level-one, or even level-two caches on the same silicon die as the microprocessor core. To enhance the system performance, it is often desirable to fit the maximum possible cache memories on the CPU core itself. When the cache is on the CPU core, the microprocessor can directly access the cache without the additional step of accessing an external bus. However, because the CPU core is of a limited size, and because cache memories require large amounts of space, it is impractical to include more than one or two caches on the CPU core. Thus, there is a need for an improved cache architecture which offers a faster access to the cache, yet does not demand a large estate on the CPU core.
One solution the prior art has to offer to the above problem is the use of a dedicated bus, which couples a cache on the CPU core to one that resides off the core. In the Intel Pentium.RTM. Pro Processor, for example, the level-one cache, L1, resides on the microprocessor core, while the level-two cache, L2, resides on a separate die. The L1 cache has a dedicated bus, sometimes referred to as the backside bus, directly coupled to the L2 cache for quick access. But even utilizing a dedicated bus in certain circumstances has several disadvantages. First, accessing the remote cache will take longer because the information has to first be placed on, and later retrieved from, the backside bus. And second, controlling the input and output pins of the external bus consumes additional power.
The present invention is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.