The present invention relates generally to cache memory structures for a processor based system and, more particularly, to an apparatus that utilizes embedded dynamic random access memory (eDRAM) as a level three (L3) cache in the system chipset of a processor based system.
The ability of processors to execute instructions has typically outpaced the ability of memory systems to supply the instructions and data to the processors. Due to the discrepancy in the operating speeds of the processors and system memory, the processor system""s memory hierarchy plays a major role in determining the actual performance of the system. Most of today""s memory hierarchies utilize cache memory in an attempt to minimize memory access latencies.
Cache memory is used to provide faster access to frequently used instructions and data, which helps improve the overall performance of the system. Cache technology is based on the premise that programs frequently reuse the same instructions and data. When data is read from main memory, a copy is usually saved in the cache memory (a cache tag is usually updated as well). The cache then monitors subsequent requests for data (and instructions) to see if the requested information has already been stored in the cache. If the data has been stored in the cache, it is delivered with low latency to the processor. If, on the other hand, the information is not in the cache, it must be fetched at a much higher latency from the system main memory.
In more advanced processor based systems, there are multiple levels (usually two levels) of cache memory. The levels are organized such that a small amount of very high speed memory is placed close to the processor while denser, slower memory is placed further away. In the memory hierarchy, the closer to the processor that the data resides, the higher the performance of the memory and the overall system. When data is not found in the highest level of the hierarchy and a miss occurs, the data must be accessed from a lower level of the memory hierarchy. Since each level contains increased amounts of storage, the probability increases that the data will be found. However, each level typically increases the latency or number of cycles it takes to transfer the data to the processor.
The first cache level, or level one (L1) cache, is typically the fastest memory in the system and is usually integrated on the same chip as the processor. The L1 cache is faster because it is integrated with the processor, which avoids delays associated with transmitting information to, and receiving information from, an external chip. The lone caveat is that the L1 cache must be small (e.g., 32 Kb in the Intel(copyright) Pentium(copyright) III processor, 128 Kb in the AMD Athlon(trademark) processor) since it resides on the same die as the processor.
A second cache level, or level 2 (L2) cache, is typically located on a different chip than the processor and has a larger capacity then the L1 cache (e.g., 512 Kb in the Intel(copyright) Pentium(copyright) III and AMD Athlon(trademark) processors). The L2 cache is slower than the L1 cache, but because it is relatively close to the processor, it is still many times faster than the main system memory. Recently, small L2 cache memories have been placed on the same chip as the processor to speed up the performance of L2 cache memory accesses.
Many current processor systems consist of a processor with an on-chip L1 static random access memory (SRAM) cache and a separate off-chip L2 SRAM cache. In some systems, a small L2 SRAM cache has been moved onto the same chip as the processor and L1 cache, in which case the reduced latency is traded for a smaller L2 cache size. In other systems, the size of the L1 cache has been increased by moving it onto a separate chip, thus trading off a larger L1 cache for increased latency and reduced bandwidth that result from off chip accesses. These options are attempts to achieve the highest system performance by optimizing the memory hierarchy. In each case, various tradeoffs between size, latency, and bandwidth are made in an attempt to deal with the conflicting requirements of obtaining more, faster, and closer memory.
FIG. 1 illustrates a typical processor based system 10 having a memory hierarchy with two levels of cache memory. The system 10 includes a processor 20 having an on-board L1 cache 22. The processor 20 is coupled to an off-chip or external L2 cache 24. The system 10 includes a system chipset comprised of a north bridge 60 and a south bridge 80. As known in the art, the chipset is the functional core of the system 10. As will be described below, the bridges 60, 80 are used to connect two or more busses and are responsible for routing information to and from the processor 20 and the other devices in the system 10 over the busses they are connected to.
The north bridge 60 contains a PCI (peripheral component interconnect) to AGP (accelerated graphics port) interface 62, a PCI to PCI interface 64 and a host to PCI interface 66. Typically, the processor 20 is referred to as the host and is connected to the north bridge 60 via a host bus 30. The system 10 includes a system memory 50 connected to the north bridge 60 via a memory bus 34. The typical system 10 may also include an AGP device 52, such as e.g., a graphics card, connected to the north bridge 60 via an AGP bus 32. Furthermore, the typical system 10 may include a PCI device 56 connected to the north bridge 60 via a PCI bus 36a.
The north bridge 60 is typically connected to the south bridge 80 via a PCI bus 36b. The PCI busses 36a, 36b may be individual busses or may be part of the same bus if so desired. The south bridge 80 usually contains a real-time clock (RTC) 82, power management component 84 and the legacy components 86 (e.g., floppy disk controller and certain DMA (direct memory access) and CMOS (complimentary metal-oxide semiconductor) memory registers) of the system 10. Although not illustrated, the south bridge 80 may also contain interrupt controllers, such as the input/output (I/O) APIC (advanced programmable interrupt controller).
The south bridge 80 may be connected to a USB (universal serial bus) device 92 via a USB bus 38, an IDE (integrated drive electronics) device 90 via an IDE bus 40, and/or an LPC (low pin count) device 94 via an LPC/ISA (industry standard architecture) bus 42. The system""s BIOS (basic input/output system) ROM 96 (read only memory) is also connected to the south bridge 80 via the LPC/ISA bus 42. The BIOS ROM 96 contains, among other things, the set of instructions that initialize the processor 20 and other components in the system 10. Examples of a USB device 92 include a scanner or a printer. Examples of an IDE device 90 include a floppy disk or hard drives and an examples of LPC devices 94 include various controllers and recording devices. It should be appreciated that the type of device connected to the south bridge 80 is system dependent.
As can be seen from FIG. 1, when the processor 20 can not access information from one of the two caches 22, 24, it is forced to access the information from the system memory 50. This means that at least two buses 30, 34 and the components of the north bridge 60 must be involved to access the information from the system memory 50, which increases the latency of the access. Increased latency reduces the system bandwidth and overall performance. Accordingly, there is a desire and need for a third level of high speed cache memory (xe2x80x9cL3 cachexe2x80x9d) that is closer to the processor 20 with respect to the system memory 50. Moreover, it is desirable that the L3 cache be much larger than the L1 and L2 caches 22, 24, yet does not substantially increase the size of the system 10.
Additionally, it should be noted that memory access times are further compounded when other devices e.g., AGP device 52 or PCI device 56 are competing with the processor 20 by simultaneously requesting information from the cache and system memories. Accordingly, there is a desire and need for an L3 cache that allows several requesting devices to access its contents simultaneously.
The present invention provides a third level of high speed cache memory (L3 cache) for a processor based system that is closer to the system processor with respect to the system memory, which reduces average memory latency and thus, increases system bandwidth and overall performance.
The present invention also provides an L3 cache for a processor based system that is much larger than the L1 and L2 caches, yet does not substantially increase the size of the system.
The present invention further provides an L3 cache for a processor based system that allows several requesting devices of the system to simultaneously access the contents of the L3 cache.
The above and other features and advantages are achieved by a large L3 cache that is integrated within the system chipset. The L3 cache is comprised of multiple embedded memory cache arrays. Each array is accessible independently of each other, providing parallel access to the L3 cache. By placing the L3 cache within the chipset, it is closer to the system processor with respect to the system memory. By using independent arrays, the L3 cache can handle numerous simultaneous requests. This reduces average memory latency and thus, increases system bandwidth and overall performance. By using embedded memory, the L3 cache can be implemented on the chipset and be much larger than the L1 and L2 caches without substantially increasing the size of the chipset and system.