1. Technical Field
The invention relates generally to the field of memory architectures, and more specifically to low-power cache designs.
2. Background Art
In microprocessor design, typically the key factor that prevents greater performance is the cycle time of data accesses to main memory. State-of-the-art dynamic random access memory (DRAM) typically has access times on the order of 60-80 nanoseconds, with data rates on the order of 40 nanoseconds in page mode or other pipelined access mode operations. Even at this speed, it takes far longer to access data than it does to process instructions through the CPU.
In order to address this problem, cache memory has been used to store data for processor operations. Typically, data is downloaded from the DRAM main memory to the cache. The cache is typically made up of an array of static RAM cells (SRAMs can be accessed at rates far faster than those of DRAMs--current state of the art SRAMs can produce data rates on the order of 5 nanoseconds). There are a number of known branch prediction or initial cache loading algorithms that determine how the cache is to be initially loaded with data.
The cache is then accessed by determining whether or not it is storing the data that is needed by the processor for a particular operation. This is done by comparing address data (referred to as the "tag") that indicates the location in main memory from which data is to be obtained, with the tag corresponding to the data as stored in the cache. If the incoming tag matches the stored tag for data in the cache, a cache "hit" signal is generated to indicate that the desired data is stored in the cache. Note that tag protocol can be set up to be "direct mapped" (i.e., each tag corresponds to one line of data stored in the cache) or "set associative" (i.e. a single "index address," or address for a single set, corresponds to a given number of lines of data stored in the cache, each having an associated tag). Thus, for example, in a "4-way set associative" cache, a single index address corresponds to four lines of data in the cache, each having its own tag.
Typically, caches also use data indicating the validity of the data stored in the cache. In general, all caches follow some sort of cache coherency protocol, to insure that the data stored in the cache is the same as data stored in main memory. Thus, for example, after the data as stored in the cache is updated, it must be written into main memory. In a "copy back cache" coherency protocol, whenever data is written to cache, it is written from the cache to main memory at some later time when there is no other traffic on the bus, using any one of a number of known algorithms. In a "write through" cache coherency protocol, when new data is written to the cache, it is written to main memory in the next cycle, suspending other conflicting bus operations. This priority write-back operation insures that for a given common address the main memory and cache store the same data. In the "copy back" mode, if a cache entry contains data at the accessed address that is not the same as the data in main memory for that address, either the cache or the main memory must be updated, depending on which is storing valid data. In practice, "bus snooping" and other techniques can be used to determine the source of data in the main memory. If data was written to main memory from an external source, chances are that it is the data in the cache that is invalid. Rather than updating the cache entry with the correct data from main memory, a data bit (commonly referred to as a "data valid" or "validity" bit) is switched (e.g. from a 1 to a 0 state) to indicate that the cache does not store valid data at that address, and at some convenient time valid data will be written from main memory to the cache.
The state of the validity bits is used to determine whether or not data must be fetched from main memory. That is, even if the tags match, if the data is not valid in the cache it must be fetched from main memory. The state of the validity bits is also used to determine where the data from main memory should be stored. When data is stored in the cache from main memory, the status of the validity bits at the accessed set will be checked, with the data from main memory being written to whichever set is storing invalid data. Another method used to determine where fetched data is to be stored is the so-called "least recently used" protocol, in which cache data is replaced based on some sort of determination of what data in the cache has been unused for the longest time (and hence is least likely to be needed for future cache accesses). However, as pointed out in U.S. Pat. No. 4,811,209, "Cache Memory With Multiple Valid Bits For Each Data Indication the Validity Within Different Contents," (issued 3/89 to Rubinstien and assigned to Hewlett Packard) conventional LRU algorithms tend to be difficult to implement in hardware and can be subject to long execution cycles, so it is more common to simply set up a priority system wherein if all the lines of data (or "cache lines") in a given set are valid, the "first" cache line will always be replaced, followed by the second, etc.
Typically, caches are implemented by a first chip that stores the index data, and separate registers (e.g. on the memory management unit or the bus master) that store the corresponding tag data, validity bits, and LRU information. That is, even though the tag and index data is accessible at the same address, they are typically located in different arrays. See for example U.S. Pat. No. 5,014,240, "Semiconductor Memory Device," (issued 5/91 to Suzuki and assigned to Fujitsu Ltd.), particularly FIGS. 2-4. However, there have been cache designs proposed in which the cache, tag, and validity information is all stored at the same physically addressable locations. Such designs are advantageous in that they reduce the total silicon area devoted to cache storage, reduce unnecessary replication of support circuits such as address buffers, address decoders, etc. needed to support data stored in physically different arrays, and remove design overhead from other microprocessor components such as bus controllers or memory managers. See e.g. FIG. 1 of the above-cited U.S. Pat. No. 5,014,240 patent; U.S. Pat. No. 4,714,990, "Data Storage Apparatus" (issued 12/87 to Desyllas et al. and assigned to International Computers Ltd.); and U.S. Pat. No. 4,914,582, "Cache Tag Lookaside," (issued 4/90 to Bryg et al. and assigned to Hewlett-Packard Co.).
However, the designs set forth in the above patents for implementing caches with both data and associated "access" information (i.e. information that is used to determine whether or not the index data is to be accessed, such as the tag address, the validity data, the LRU data, etc.) are stored at the same physical addressable location do not optimize the cache architecture for high performance. Typically, when the index address indicates that a cache line is to be accessed, the line is read (that is, the line driver circuits are enabled to access the cells, the sense amps are set to detect the data, etc.). Then, if the access information indicates that data is not to be accessed during that given cycle, the data input/output drivers are inhibited. Thus, for example if the incoming tag address is not the same as one of the tag address(es) associated with the accessed cache index, then a cache "miss" is generated and the cache access is terminated. The problem with this approach is the extra power that has been consumed in preparing the data for access. In workstation and other high-end computer applications this extra power overhead can be tolerated. However, in low-power applications such as low-power personal computer applications, laptops, personal digital assistants, and the like, this power overhead becomes prohibitive.
Another shortcoming in the prior integrated cache designs is that they do not provide a true least recently used mode for updating the cache with data from main memory. Ideally the LRU hardware implementation should be integrated with the cache design to optimize the hit rate of the cache.
Typically integrated caches are laid out as separate functional units (or macros) on the microprocessor chip. Because the CPU and other functional blocks on the microprocessor receive signals from external signal sources, they must support interconnection pads that are bonded to external pins when the chip is packaged. As caches increase in size they take up more space on the microprocessor. As such, they tend to constrain the metallization patterns from the CPU, etc. to the interconnection pads, because the cache data could be disturbed by noise and/or capacitive coupling generated from overlying metallization patterns transmitting high frequency signals from external signal sources to the CPU, etc.
Accordingly, a need has developed in the art for an integrated cache design in which the benefits of storing the data and associated access information in the same physically addressable array can be realized without the attendant power overhead.