As the operating frequencies of microprocessors, integrated circuit (IC) memory, and other integrated circuitry continue to increase in conjunction with continually increasing integration scale and decreasing device feature sizes, power consumption and means for reducing such consumption of ICs are issues that are moving to the forefront of IC design. Of course, power consumption and reduction are issues with mobile IC-based devices, such as laptop computers, cell phones, PDAs, etc., that utilize batteries, but they are also issues of concern to devices that draw their power directly from the utility power grid.
Most of the power usage reducing techniques implemented in IC-based devices to date are generally directed to reducing active power consumption by systematically reducing the power provided to these devices during times when full power is not needed. For example, may IC-based devices typically have one or more reduced-power, or standby, modes, such as sleep mode, nap mode, doze, and hibernate modes, among others. However, in today's deep sub-micron technologies, standby power consumption itself is becoming a larger problem due to gate-tunneling and sub-threshold currents.
Various techniques have been implemented to reduce power consumption at the IC circuitry component level. For example, in the context of cache memory, the timing of the memory access is manipulated so as to reduce power consumption. The benefit of reduced power consumption, however, is realized at a slight cost to the speed of the cache memory. To illustrate, FIG. 1A shows a simple conventional two-way set-associative cache memory system 100 that includes a cache memory 104 partitioned into two banks, or ways 108A-B, each having 256 corresponding respective cache lines 112A-B that each contain thirty-six-bit words 116A-D. Generally, each cache line 112A-B contains a block of words transferred between a main memory (not shown) and cache memory 104 to take advantage of spatial locality. Cache memory 104 will store (as a function of cache storage rules not discussed herein) data or addresses for a subset of the total main memory. Cache memory system 100 also includes a tag directory 120 that will store the addresses for the data in cache memory 104. The contents of cache memory 104 is accessed as a function of an incoming address, e.g., address 124, received from outside memory system 100, e.g., from a microprocessor or microcontroller (not shown).
In this example, incoming address 124 is 32-bits long and is divided into the following parts: the two least-significant bits 124A select one of the four bytes in a particular word 116A-D; the next two bits 124B select one of the four words 116A-D within a particular cache line 112A-B; the fourth through the eleventh bits 124C (“cache line address bits”) select a particular cache line 108A-B within cache memory 104; and the upper twenty bits 116D form a “tag” that is used in the cache retrieval process as described below. The lower twelve bits, i.e., bits 124A-C, of incoming address 124 are directly mapped from main memory into cache memory 104. The remaining 20 bits, i.e., tag bits 124D, of incoming address 124 are used to determine if a specific address has been stored in cache memory 104. The particulars of set-associate cache systems are well known and, therefore, are not described herein. However, in general, set associative cache systems, such as system 100 illustrated, allow multiple addresses having the same physical address (i.e., addresses of the lower twelve bits 124A-C) to be stored. In the two-way example of FIG. 1A, two identical addresses can be stored-one in way 108A and one in way 108B.
Generally, an access to cache memory 104 is initiated when a clock cycle captures incoming address 124 for use with tag directory 120 and the cache memory. Tag directory 120 receives the eight cache-line-address bits 124C of incoming address 124 and then outputs, from among the plurality of tags 128 stored in the tag directory, the two twenty-bit tags TAG-A, TAG-B corresponding to cache-line address expressed by the cache-line address bits. Of course, tags TAG-A, TAG-B are from corresponding tag sets 130A-B that correspond respectively to ways 108A-B of cache memory 104. Tags TAG-A, TAG-B feed from tag directory 120 into a comparator 132 that compares each of tags TAG-A, TAG-B to tag bits 124D of incoming address 124 to determine whether there is a match between the incoming tag bits and either of tags TAG-A, TAG-B. Essentially, comparator 132 determines if the data being sought via incoming address 124 is stored in cache memory 104.
A match of tag bits 124D to one of tags TAG-A, TAG-B means that the data sought by incoming address 124 is stored in cache memory 104 and there is a “cache hit.” Correspondingly, comparator 132 identifies via ASELECT and BSELECT signals which one of ways 108A-B contains the data. That is, if tag bits 124D match tag TAG-A, ASELECT signal goes high while BSELECT signal remains low. Alternatively, if tag bits 124D match tag TAB-B, BSELECT signal goes high while ASELECT signal remains low. On the other hand, if tag bits 124D do not match either of tags TAG-A, TAG-B, then the data is not stored in cache memory 104 and there is a “cache miss.”
In parallel with tag directory 120 receiving cache-line-address bits 124C, cache memory 104 receives the cache-line-address bits, as well as bits 124A (and, optionally, bits 124B) of incoming address 124 and subsequently output to a 2:1 multiplexer 136 the two 36-bit words (or optionally two bytes) DATA-A, DATA-B, i.e., one word (or byte) DATA-A from way 108A and one word (or byte) DATA-B from way 108B, corresponding to the cache lines 112A-B identified by cache-line-address bits 124C. If there is a cache hit, 2:1 multiplexer 136 will output either data DATA-A or data DATA-B as DATA-OUT, depending on which of ASELECT and BSELECT signals is high. Because tag directory 120 contains fewer bits than cache memory 104, its physical size is much smaller than the cache memory and, hence, it can be accessed faster than the cache memory.
Referring to FIG. 1B, and also to FIG. 1A, FIG. 1B shows a timing diagram 140 illustrating the timing of various signals within cache memory system 100 of FIG. 1A for parallel access of tag directory 120 and cache memory 104. Such timing allows the smaller tag directory 120 to fetch tags TAG-A, TAG-B, and comparator 132 to compare tag bits 124D of incoming address 124 to tags TAG-A, TAG-B so as to activate either ASELECT or BSELECT signal, prior to cache memory 104 providing data DATA-A, DATA-B to multiplexer 136. In particular, this is illustrated by tag TAG-A/TAG-B signals 144 (activated in response to edge 148A of a clock signal 148 and address tag signals 152 of address bits 124D of incoming address A1) and an ASELECT/BSELECT signal 156 corresponding to one of ASELECT and BSELECT signals going high, both activating prior to data DATA-A/DATA-B signals 160 activating. After a delay caused by multiplexer 136, data-out signals 164 corresponding to either data DATA-A or data DATA-B are output by the multiplexer.
In this manner, the tag lookup and matching functions performed by tag directory 120 and comparator 132 can be accomplished with a minimum latency penalty to cache memory 104. The penalty for this architecture, however, is the power consumed by activating and accessing both of ways 108A-B of cache memory 104 to retrieve the desired data, i.e., either data DATA-A or data DATA-B. In order to save active power, some conventional architectures have waited on the access to tag directory 120 prior to accessing the desired bank, in this case way 108A or way 108B. This was done because, as mentioned above, power saving measures were focused on reducing active power consumption, which was the biggest problem in older technologies. Again, in today's deep sub-micron technologies, however, standby power consumption caused by gate-tunneling and sub-threshold currents is becoming a bigger problem.