1. Field of the Invention
The invention generally relates to a method for transferring data between a central processing unit (CPU) and main memory in a computer system. More specifically, the invention describes various implementations for minimizing the latency in accessing main memory by using a latency hiding mechanism.
2. Description of the Prior Art
Microprocessor speed and computing power have continuously increased due to advancements in technology. This increase in computing power depends on transferring data and instructions between a main microprocessor and the main memory at the processor speed. Unfortunately, current memory systems cannot offer the processor its data at the required rate.
The processor has to wait for the slow memory system by using wait states, thereby causing the processor to run at a much slower speed than its rated speed. This problem degrades the overall performance of the system. This trend is worsening because of the growing gap between processor speeds and memory speeds. It may soon reach a point where any performance improvements in the processor cannot produce a significant overall system performance gain. The memory system thus becomes the limiting factor to system performance.
According to Amdahl's law, the portion of the system that cannot be improved limits the performance improvement of a system. The following example illustrates this reasoning: if 50% of a processor's time is spent accessing memory and the other 50% is spent in internal computation cycles, Amdahl's law states that for a ten fold increase in processor speed, system performance only increases 1.82 times. Amdahl's Law states that the speedup gained by enhancing a portion of a computer system is given by the formula
  Speedup  =      1                  (                  1          -          Fraction_enhanced                )            +              Fraction_enhanced        Speedup_enhanced            where                Fraction_enhanced is the proportion of time the enhancement is used        Speedup_enhanced is the speedup of the portion enhanced compared to the original performance of that portion.Thus, in the example, since the processor is occupied with internal computation only 50% of the time, the processor's enhanced speed can only be taken advantage of 50% of the time. Amdahl's Law, using the above numbers, then becomes,        
  Speedup  =            1                        (                      1            -            0.5                    )                +                  0.5          10                      =    1.82  
This is because the enhancement can only be taken advantage of 50% of the time and the enhanced processor is 10 times the speed of the original processor. Calculating the speedup yields the overall performance enhancement of 1.818 times the original system performance.
If the enhanced processor is 100 times the speed of the original processor, Amdahl's Law becomes
  Speedup  =            1                        (                      1            -            0.5                    )                +                  0.5          100                      =    1.98  
This means that the system performance is limited by the 50% of data accesses to and from the memory. Clearly, there is a trend of declining benefit as the speed of the processor increases vs. the speed of the main memory system.
The well known cache memory system has been used to solve this problem by moving data most likely to be accessed by the processor to a fast cache memory that can match the processor speed. Various approaches to creating a cache hierarchy consisting of a first level cache (L1 cache) and a second level cache (L2 cache) have been proposed. Ideally, the data most likely to be accessed by the processor should be stored in the fastest cache level. Typically, both Level 1 (L1) and Level 2 (L2) caches are implemented with static random access memory (SRAM) technology due to its speed advantage over dynamic random access memory (DRAM). The most crucial aspect of cache design and the problem which cache design has focused on, is ensuring that the data next required by the processor has a high probability of being in the cache system. Two main principles operate to increase the probability of finding this required data in the cache, or having a cache “hit”: temporal locality and spatial locality. Temporal locality refers to the concept that the data next required by the processor has a high probability of being required again soon for most average processor operations. Spatial locality refers to the concept that the data next required by the processor has a high probability of being located next to the currently accessed data. Cache hierarchy therefore takes advantage of these two concepts by transferring from main memory data which is currently being accessed as well as data physically nearby.
However, cache memory systems cannot fully isolate a fast processor from the slower main memory. When an address and associated data requested by the processor is not found in the cache, a cache “miss” is said to occur. On such a cache miss, the processor has to access the slower main memory to get data. These misses represent the portion of processor time that limits overall system performance improvement.
To address this cache miss problem, Level 2 cache is often included in the overall cache hierarchy. The purpose of Level 2 cache is to expand the amount of data available to the processor for fast access without increasing Level 1 cache, which is typically implemented on the same chip as the processor itself. Since the Level 2 cache is off-chip (i.e. not on the same die as the processor and Level 1 cache), it can be larger and can run at a speed between the speed of the Level 1 cache and the main memory speed. However, in order to properly make use of Level 1 and Level 2 cache and maintain data coherency between the cache memory system and the main memory system, both the cache and the main memory must be constantly updated so that the latest data is available to the processor. If the processor memory access is a read access, this means that the processor needs to read data or code from the memory. If this requested data or code is not to be found in the cache, then the cache contents have to be updated, a process generally requiring that some cache contents have to be replaced with data or code from main memory. To ensure coherency between the cache contents and the contents of main memory, two techniques are used: write-through and write-back.
The write-through technique involves writing data to both the cache and to main memory when the processor memory access is a write access and when the data being written is to be found in the cache. This technique ensures that, whichever data,is accessed, either the cache contents or the main memory, the data accessed is identical.
The write-back technique involves writing data only to the cache in a memory write access. To ensure coherence between the data in the cache and the data in main memory, the cache contents of a particular cache location are written to main memory when these cache contents are about to be overwritten. However, cache contents are not written to main memory if they have not been replaced by a memory write access. To determine if the cache contents of a particular cache location have been replaced by a memory write access, a flag bit is used. If the cache contents have been replaced by a memory write access, the flag bit is set or is considered “dirty”. Thus, if the flag bit of a particular cache location is “dirty”, then the cache contents of that cache location have to be written to main memory prior to being overwritten with new data.
Another approach for increasing the cache hit rate is by increasing its associativity. Associativity refers to the number of lines in the cache which are searched (i.e. checked for a hit) during a cache access. Generally, the higher the associativity, the higher the cache hit rate. A direct mapped cache system has a 1:1 mapping whereby during a cache access, only one line is checked for a hit. At the other end of the spectrum, a fully associative cache is typically implemented using a content addressable memory (CAM) whereby all cache lines (and therefore all cache locations) are searched and compared simultaneously during a single cache access. Various levels of associativity have been implemented.
Despite these various approaches to improving cache performance aimed at ultimately improving overall system performance, it should be noted that cache performance can only be improved up to a point by changing its parameters such as size, associativity, and speed. This approach of focusing on improving the cache system or the fast memory of the system rather than trying to improve the slower main memory, eventually reaches a saturation point—any further attempts at improving overall system performance through cache system improvements will generate decreasing levels of system performance improvement. Conceivably, main memory performance could be eliminated as a factor in overall system performance if the cache is made as large as main memory, but this would be prohibitively expensive in terms of silicon chip area. As a result, what is needed is a way of obtaining maximum system performance with a minimum sized cache.
This speed mismatch between processors and main memory has recently been exacerbated by new software applications such as multimedia which depend heavily on main memory performance. Unfortunately, main memory performance is limited by the frequent random data access patterns of such applications. Cache systems are therefore less effective when used with these applications.
To alleviate the speed mismatch between processors arid main memory, numerous attempts at improving main memory performance have been carried out. These have yielded some improvements in main memory speed. Early improvements to DRAM involved getting multiple bits out of the DRAM per access cycle (nibble mode, or wider data pinout), internally pipelining various DRAM operations, or segmenting the data so that some operations would be eliminated for some accesses (page mode, fast page mode, extended data out (EDO) mode).
Page mode involves latching a row address in the DRAM and maintaining it active, thereby effectively enabling a page of data to be stored in the sense amplifiers. Unlike in page mode where column addresses are then strobed in by the Column Address Strobe signal CAS\ in fast page mode, the column address buffers are activated as soon as the Row Address Strobe RAS\ signal is activated, and act as transparent latches, allowing internal column data fetch to occur before column address strobe. The enabling of the data output buffer is then accomplished when CAS\ is activated. These different page modes are therefore faster than pure random access mode since staying on the same row eliminates the row address activation time required for accessing new rows.
Subsequent improvements were realized through extended data out mode or EDO mode and in burst EDO mode. Burst EDO mode allows a page of sequential data to be retrieved from the DRAM without having to provide a new address on every cycle. However, it should be noted that while burst EDO mode is useful for graphics applications which require pages of sequential information, it is less useful for main memory applications which require random access to still be fully supportable.
Although such improvements in DRAM designs offer higher bandwidth access, they suffer from the following problems: processors cannot fully utilize the new DRAM higher bandwidth because some scattered memory accesses do not map in the same active row, thereby obviating gains from using fast page mode; although new DRAM designs may have several banks, they are not in sufficient numbers for a typical processor environment with scattered memory accesses to have high page hit rates; current processors and systems use large caches (both first and second level) that intercept memory accesses to the DRAM thereby reducing the locality of these accesses—this further scatters the accesses and consequently further reduces page hit rates.
The inability of cache systems to improve system performance have motivated further efforts to improve the performance of the main DRAM memory system. One of these efforts yielded the SDRAM, (Synchronous DRAM). SDRAM uses multiple banks and a synchronous bus to provide a high bandwidth for accesses which use the fast page mode. With multiple SDRAM banks, more than one active row can supply the processor with fast accesses from different parts of memory. However, for fast page mode to be used, these accesses have to be in an active row of a bank. Furthermore, relying solely on accessing multiple banks to increase memory bandwidth results in an overall limitation based on the number of banks that the memory can be divided into.
In general, a limited number of banks, external cache systems which intercept accesses to already activated rows in main memory and poor spatial localities of the accessed data all contribute to limiting the performance gain from the SDRAM.
Another effort yielded the Cache DRAM (CDRAM). This design incorporates an SRAM-based cache inside the DRAM. Large blocks of data can thus be transferred from the cache to the DRAM array or from the DRAM to cache in a single clock cycle. However, this design suffers from problems of low cache hit rate inside the DRAM caused by the external intercepting caches, and poor data localities. It also adds complexity to the external system for controlling and operating the internal cache by requiring a cache tag, a comparator and a controller. In addition, there is a significant cost in terms of die area penalty for integrating SRAM cache with a DRAM in a semiconductor manufacturing process optimized for DRAM.
Newer designs merge processor and DRAM by eliminating the intercepting cache problem and exposing the full DRAM bandwidth to the processor. This approach increases system complexity, mixes slow and fast technology, limits the space for the processor, and cannot fully utilize the high DRAM bandwidth because of the nature of scattered memory accesses used by the current programming model.
The new Virtual Channel DRAM design from NEC uses 16 fully associative channels, implemented with fast SRAM, to track multiple code and data streams in use by various sources. Essentially Virtual Channel DRAM represents an extension of the page mode concept where the one-bank/one page restriction is removed. As a result, a number of channels (or pages) can be opened within a bank independently of other channels. A CPU can for example access up to 16 lk channels randomly allocated within a Virtual Channel DRAM bank. As a result, memory traffic between multiple devices can be sustained without causing repeated page allocation conflicts. The Virtual Channel Memory requires that the CPU, thereby complicating its controlling function, track the main memory location corresponding to each channel. In addition the CPU requires a predictive scheme for effective prefetching of data to the channels. Virtual Channel DRAM uses Fast Page mode to transfer data to channels and finally, like the Cache DRAM, VC DRAM is expensive due to the additional die area consumed by the associative buffers. In addition, the amount of cache provided may not be appropriate for some applications because the cache/DRAM ratio is usually fixed. For example, when main memory is upgraded, the additional cache may not be necessary so the system cost is unnecessarily high.
Recently, software-based solutions have also been proposed such as using a software compiler to re-map physical memory addresses in order to maximize DRAM bandwidth. While this is useful for specific applications that have predictable behaviour, it requires changing software, thereby causing compatibility problems. These efforts use a high level approach whereby the source code of an application is revised to make the software be tailored to the hardware. Not only is this approach expensive and time consuming, it is not applicable to all software applications.
From the above, what is therefore needed is a solution based on a simplified memory control mechanism, using a simple, cost effective standard DRAM for main memory, requiring the minimum of hardware, and not requiring extensive software rewrites or a complex addressing scheme. Such a solution should ideally take advantage of both temporal and spatial localities. Not only should recently accessed data be readily accessible but data adjacent in location to such recently accessed data should also be readily accessible.