The present invention relates generally to the field of processors and in particular to a method of reducing power consumption in a copy-back data cache by inspection of a global modified indicator.
Microprocessors perform computational tasks in a wide variety of applications, including embedded applications such as portable electronic devices. The ever-increasing feature set and enhanced functionality of such devices requires ever more computationally powerful processors, to provide additional functionality via software. Another trend of portable electronic devices is an ever-shrinking form factor. An impact of this trend is the decreasing size of batteries used to power the processor and other electronics in the device, making power efficiency a major design goal. Hence, processor improvements that increase execution speed and reduce power consumption are desirable for portable electronic device processors.
Many programs are written as if the computer executing them had a very large (ideally, unlimited) amount of fast memory. Most modern processors simulate that ideal condition by employing a hierarchy of memory types, each having different speed and cost characteristics. The memory types in the hierarchy commonly vary from very fast and very expensive at the top, to progressively slower but more economical storage types in lower levels. Due to the spatial and temporal locality characteristics of most programs, the instructions and data executing at any given time are statistically likely to be needed in the very near future, and may be advantageously retained in the upper, high-speed hierarchical layers, where they are readily available. As code progresses and/or branches to new areas, the necessary instructions and data may be loaded from the lower memory hierarchy levels into the upper levels. While this movement of instructions and data between memory hierarchy levels incurs some performance degradation and may require complex hardware and software management, the overall result is a net increase in memory performance over using only the slow memory types, with considerable cost savings as compared to using only the fast memory types.
A representative processor memory hierarchy may comprise an array of General Purpose Registers (GPRs) in the processor core as the top level. These are the fastest memory—in many cases employing both edges of the clock, and hence able to both write and read data in a single cycle. Constructed from gates on the processor die, GPRs are expensive in terms of silicon area, power consumption and the overhead they impose in terms of routing, clock distribution and the like.
Processor registers may be backed by one or more on-chip cache memories, which comprise the primary instruction and data storage structures for active code; for example, hit rates in many instruction caches may reach 97-98%. On-chip caches (also known in the art as Level-1 or L1 caches) are expensive for the same reasons discussed above with respect to GPRs. However, caches may be implemented as DRAM structures, achieving a much higher density and hence lower cost per bit than GPRs. Separate caches may be dedicated to storing instructions and data, and the data caches may be managed according to a variety of strategies, as discussed further herein.
Depending on the implementation, a processor may include one or more off-chip, or L2 caches. L2 caches are often implemented in SRAM for fast access times, and to avoid the performance-degrading refresh requirements of DRAM. Below all the caches is main memory, usually implemented in DRAM for maximum density and hence lowest cost per bit. The main memory may be backed by hard disk storage, which is generally implemented on magnetic media accessed via mechanically actuated sensors, and hence extremely slow compared to the electronic access of higher levels of the memory hierarchy. The disks may further be backed by tape or CD, comprising magnetic or optical media, respectively. Most portable electronic devices have limited, if any, disk storage and no tape/CD backup, and hence main memory (often limited in size) is the lowest level of the memory hierarchy.
In a computer memory hierarchy, each lower level maintains a full (but possibly stale) copy of the data resident in higher layers. That is, the data stored in higher levels replicates that in the lower levels. Changes to data stored in the upper levels of the memory hierarchy must be propagated down to the lower levels. Changes to the GPRs are expressly propagated to caches by STORE instructions; changes to the caches are automatically propagated to main memory under the direction of a cache controller.
In general, two approaches have developed in the art to propagating modifications to the data in a cache to main memory: write-through and copy-back. In a write-through cache, when a processor writes modified data to its L1 cache, it additionally writes the modified data to main memory (any intervening caches are omitted for the purpose of this discussion). In a write-through cache, the main memory always contains the most recent version of the data; hence data stored in a cache entry may be discarded at any time, without special processing. As discussed below, this simplifies cache management.
Under a copy-back algorithm, a processor may write modified data to an L1 cache, but is not required to immediately update main memory. The cache entry then contains data that is different from the version in main memory, often referred to as a “dirty” entry. The cache entry is marked to reflect this, such as by setting a “dirty bit.” The modified data is written to main memory at a later time, such as when the cache entry is replaced in processing a cache miss, or under software control. Copy-back cache management may improve performance when a processor performs many data writes, because writing to the cache generally incurs a much shorter latency than writing to main memory. The copy-back algorithm also reduces bus traffic to main memory, which may reduce power consumption. The two cache management algorithms are not mutually exclusive; a single cache may manage some entries under a write-through algorithm, and may manage others using a copy-back algorithm.
Because the cache size is limited compared to main memory, the cache is “shared” by the entire memory, on a temporal basis. That is, data from different areas of main memory may occupy the same cache entry at different times. If a memory access “misses” in the cache, the data are retrieved from main memory and stored in the cache. Once the cache fills with data during use, a cache miss that retrieves data from memory must displace a currently occupied entry in the cache. A cache entry managed under a write-through algorithm may be replaced without any special processing.
A cache entry managed under a copy-back algorithm, however, must be checked to see if the data are dirty prior to replacement. The cache line must be read, and the dirty bit inspected. If the existing data in the selected cache entry are dirty (that is, different from the version in main memory), they must be written to main memory prior to replacing the cache entry with the new data read from memory. In most implementations, the processor is not aware of the cache management algorithm of existing cache entries. Hence, if any entry in the cache is (or may be) managed under a copy-back algorithm, every entry must be read upon replacement to ascertain whether the entry is copy-back, and if so, whether it is dirty (both of which inquiries may collapse to simply inspection of the dirty bit). Reading every cache entry for dirty bit inspection upon replacement of the entry consumes power, and is superfluous when it is known that no cache entry has been modified.