The speed at which processors can execute instructions has typically outpaced the speed at which memory systems can supply the instructions and data to the processors. Due to this discrepancy in the operating speeds of the processors and system memory, the system memory architecture plays a major role in determining the actual performance of the system. Most current memory hierarchies utilize cache memory in an attempt to minimize memory access latencies.
Cache memory is used to provide faster access to frequently used instructions and data, which helps improve the overall performance of the system. Cache memory is able to provide faster access for two primary reasons. First, cache memory is generally implemented with static random access memory (“SRAM”), which is substantially faster than dynamic random access memory (“DRAM”) that is normally used as system memory. Second, cache memory is normally coupled to the processor directly through a processor bus and thus has a hierarchy that places it closer to the processor. In memory hierarchy, the closer to the processor that the memory resides, the higher the performance of the memory and the overall system. Cache memory is effective to increase the speed at which programs can be executed because programs frequently reuse the same instructions and data. When data or instructions are read from main memory, a copy is usually saved in the cache memory (a cache tag is usually updated as well). The cache then monitors subsequent requests for data and instructions to see if the requested information has already been stored in the cache. If the data has been stored in the cache, which is known as a “cache hit,” it is delivered with low latency to the processor. If, on the other hand, the information is not in the cache, which is known as a “cache miss,” it must be fetched at a much higher latency from the system memory.
In more advanced processor based systems, there are multiple levels (usually two levels) of cache memory. The first cache level, or level one (L1) cache, is typically the fastest memory in the system and is usually integrated on the same chip as the processor. The L1 cache is faster because it is integrated with the processor and thus has a higher level of hierarchy. This higher level of hierarchy avoids delays associated with transmitting information to, and receiving information from, an external chip. Also, it generally operates at the usually faster speed of the processor. However, since it resides on the same die as the processor, the L1 cache must be relatively small (e.g., 32 Kb in the Intel® Pentium® III processor, 128 Kb in the AMD Athlon™ processor).
A second cache level, or level two (L2) cache, is typically located on a different chip than the processor and has a larger capacity then the L1 cache (e.g., 512 Kb in the Intel® Pentium® III and AMD Athlon™ processors). The L2 cache is slower than the L1 cache, but because it is relatively close to the processor, it is still many times faster than the system memory, which has an even lower level of memory hierarchy. Recently, small L2 cache memories have been placed on the same chip as the processor to speed up the performance of L2 cache memory accesses.
When data is not found in the highest level of the memory hierarchy and a cache miss occurs, the data must be accessed from a lower level of the memory hierarchy. Since each level contains increased amounts of storage, the probability increases that the data will be found. However, each level typically increases the latency or number of cycles it takes to transfer the data to the processor.
FIG. 1 illustrates a typical processor based system 10 having with two levels of cache memory hierarchy. The system 10 includes a processor 20 having an on-board L1 cache 22 that is fabricated on the same chip as the processor 20. The processor 20 is coupled to an off-chip or external L2 cache 24. The system 10 includes a system chipset comprised of a system controller 60 (also known as a “north bridge”) and a bus bridge 80 (also known as a “south bridge”). As known in the art, the chipset is the functional core of the system 10. As will be described below, the system controller 60 and bus bridge 80 are used to connect two or more busses and are responsible for routing information to and from the processor 20 and the other devices in the system 10 over the busses to which they are connected.
The system controller 60 contains an accelerated graphics port (“AGP”) interface 62, a PCI interface 64 and a host interface 66. Typically, the processor 20 is referred to as the host and is connected to the host interface 66 of the system controller 60 via a host bus 30. The system 10 includes a system memory 50 connected to a memory controller 67 in the system controller 60 via a memory bus 34. The typical system 10 may also include an AGP device 52, such as e.g., a graphics card, connected to the AGP interface 62 of the system controller 60 via an AGP bus 32. Furthermore, the typical system 10 may include a PCI device 56 connected to the PCI interface 64 of the system controller 60 via a PCI bus 36.
The PCI interface 64 is also typically connected to the bus bridge 80 via the PCI bus 36. A single PCI bus 36 may be used, as shown in FIG. 1, or, alternatively, individual PCI busses may be used if so desired. The bus bridge 80 may be coupled through an expansion bus, such as an industry standard architecture (“ISA”) bus 42, to a real-time clock (RTC) 82, power management component 84 and various legacy components 86 (e.g., a floppy disk controller and certain direct memory access (“DMA”) and complimentary metal-oxide semiconductor (“CMOS”) memory registers) of the system 10. A basic input/output system (“BIOS”) read only memory 96 (“ROM”) and a low pin count (“LPC”) device 94 are also connected to the bus bridge 80 via the ISA bus 42. Examples of LPC devices 94 include various controllers and recording devices. The BIOS ROM 96 contains, among other things, the set of instructions that initialize the processor 20 and other components in the system 10. Although not illustrated, the bus bridge 80 may also contain interrupt controllers, such as the input/output (“I/O”) advanced programmable interrupt controller (“APIC”). The bus bridge 80 may also be connected to a universal serial bus (“USB”) device 92 via a USB bus 38, and to an integrated drive electronics (“IDE”) device 90 may be connected via an IDE bus 40. Examples of a USB device 92 include a scanner or a printer. Examples of an IDE device 90 include a floppy disk or hard drives. It should be appreciated that the type of device connected to the bus bridge 80 is system dependent.
As can be seen from FIG. 1, when the processor 20 cannot access information from one of the two caches 22, 24, it is forced to access the information from the system memory 50. As a result, at least two buses 30, 34 and the components of the system controller 60 must be involved to access the information from the system memory 50, thereby increasing the latency of the access. Increased latency reduces the system bandwidth and overall performance. Memory access times are further compounded when other devices e.g., AGP device 52 or PCI device 56, are competing with the processor 20 by simultaneously requesting information from the cache and system memories.
Attempts have been made to solve or at least alleviate the above-described problems by integrating a third level of cache, known as “L3 cache” 68, in the system controller 60, and preferably as part of the memory controller 67. This L3 cache is also known as “eDRAM” because it is normally implemented with dynamic random access memory (“DRAM”) embedded in the same integrated circuit in which the system controller 60 is fabricated. Since the L3 cache 68 is closer to the processor 20 than the system memory 50, it has a higher hierarchy and thus a lower latency than the system memory 50. More specifically, the processor 20 can access the L3 cache 68 without having to send or receive information over the memory bus 34. Instead, the processor 20 need only receive data or instructions over the host bus 30. As a result, instructions and data can be read from the L3 cache 68 significantly faster than instructions and data can be read from the system memory 50. Furthermore, since the L3 cache 68 can be implemented with eDRAM, it is economically and technically feasible to make the L3 cache 68 much larger than the L1 and L2 caches 22, 24, respectively, thus reducing the likelihood of a cache miss. The use of an eDRAM L3 cache 68 can therefore increase the system bandwidth and overall performance of the processor based system 10.
Although an L3 cache 68 can increase system bandwidth, the latency of the L3 cache 68 is less than optimum because of delays in initiating an access to the L3 cache 68. More specifically, the processor 20 or other memory access device does not attempt to access the L3 cache 68 until a tag array (not shown in FIG. 1) in the L3 cache 68 has been accessed to determine if the requested data or instructions are stored in the L3 cache. In the event of a cache hit, the requested data or instructions are transferred from the eDRAM array (not shown in FIG. 1) to the processor 20 or other memory requester. Thus, in the event of a cache hit, the requested data or instructions are not transferred until two memory access have occurred, i.e., access to the tag array and access to the eDRAM array. As a result, the access to the L3 cache 68 is not completed for considerable period after the processor 20 has initially attempted to access data or instructions from the L1 cache 22.
The presence of the L3 cache 68 can also cause an increase in the access latency to the system memory 50 in the event data or instructions are not stored in any of the L1, L2 or L3 caches, 22, 24, 68, respectively. The primary reason that the presence of the L3 cache 68 can increase the access latency of the system memory 50 is that the processor 20 or other memory access device does not initiate an access to the system memory 50 until the processor 20 has attempted to access the L3 cache 68 and detected a cache miss. As a result, the access to the system memory 50 is not started for considerable period after the processor 20 has attempted to access data or instructions from the L3 cache.
There is therefore a need for an embedded L3 cache and a method of operating an embedded L3 cache that has reduced latency and that does not increase the latency of accesses to system memory.