1. Field of the Invention
This application is related to commonly-owned co-pending U.S. patent application Ser. No. 12/031,006, entitled “A 3-DIMENSIONAL L2/L3 CACHE ARRAY TO HIDE TRANSLATION (TLB) DELAYS” filed on the same day as the present application, which is herein incorporated by reference.
2. Description of the Related Art
Modern computer systems typically contain several integrated circuits (ICs), including a processor which may be used to process information in the computer system. The data processed by a processor may include computer instructions which are executed by the processor as well as data which is manipulated by the processor using the computer instructions. The computer instructions and data are typically stored in a main memory in the computer system.
Processors typically process instructions by executing each instruction in a series of small steps. In some cases, to increase the number of instructions being processed by the processor (and therefore increase the speed of the processor), the processor may be pipelined. Pipelining refers to providing separate stages in a processor where each stage performs one or more of the small steps necessary to execute an instruction. In some cases, the pipeline (in addition to other circuitry) may be placed in a portion of the processor referred to as the processor core. Some processors may have multiple processor cores.
As an example of executing instructions in a pipeline, when a first instruction is received, a first pipeline stage may process a small part of the instruction. When the first pipeline stage has finished processing the small part of the instruction, a second pipeline stage may begin processing another small part of the first instruction while the first pipeline stage receives and begins processing a small part of a second instruction. Thus, the processor may process two or more instructions at the same time.
To provide for faster access to data and instructions as well as better utilization of the processor, the processor may have several caches. A cache is a memory which is typically smaller than the main memory and is typically manufactured on the same die (i.e., chip) as the processor. Modern processors typically have several levels of caches. The fastest cache which is located closest to the core of the processor is referred to as the Level 1 cache (L1 cache). In addition to the L1 cache, the processor typically has a second, larger cache, referred to as the Level 2 Cache (L2 cache). It is not uncommon for modern processor to have other, additional cache levels, for example, an L3 cache and an L4 cache.
To provide the processor with enough instructions to fill each stage of the processor's pipeline, the processor may retrieve instructions from the L2 cache in a group containing multiple instructions, referred to as an instruction line (I-line). The retrieved I-line may be placed in the L1 instruction cache (I-cache) where the core of the processor may access instructions in the I-line. Blocks of data to be processed by the processor may similarly be retrieved from the L2 cache and placed in the L1 cache data cache (D-cache).
The process of retrieving information from higher cache levels and placing the information in lower cache levels may be referred to as fetching, and typically requires a certain amount of time (latency). For instance, if the processor core requests information and the information is not in the L1 cache (referred to as a cache miss), the information may be fetched from the L2 cache. Each cache miss results in additional latency as the next cache/memory level is searched for the requested information. For example, if the requested information is not in the L2 cache, the processor may look for the information in an L3 cache or in main memory.
In some cases, a processor may process instructions and data faster than the instructions and data are retrieved from the caches and/or memory. For example, after an I-line has been processed, it may take time to access the next I-line to be processed (e.g., if there is a cache miss when the L1 cache is searched for the I-line containing the next instruction). While the processor is retrieving the next I-line from higher levels of cache or memory, pipeline stages may finish processing previous instructions and have no instructions left to process (referred to as a pipeline stall). When the pipeline stalls, the processor is underutilized and loses the benefit that a pipelined processor core provides.
L3 and higher caches are generally required to be relatively large in order to have sufficient storage to service a plurality of processors. For example, an L3 cache may be shared by 8 or 16 processor cores. The large size of L3 and higher caches result in much higher access latency for the higher level caches, therefore increasing the number of pipeline stall cycles.
Furthermore, to conserve chip space, L3 and higher caches are typically designed as Dynamic Random Access Memory (DRAM) devices because DRAM devices are significantly smaller than comparable Static Random Access (SRAM) devices. However, one problem with using DRAM devices is the relatively higher access time in comparison to SRAM devices. The higher access time to retrieve data from a DRAM based L3 cache after a cache miss in the L2 cache may result in a further increase in the number of pipeline stall cycles during which the processors are unable to process instructions. Therefore, overall performance and efficiency may be adversely affected.
Accordingly, there is a need for improved methods of retrieving data from an L3 cache.