The present invention relates to computers and, more particularly, to a method for managing a set-associative cache. A major objective of the present invention is to reduce the average power consumed during single-cycle read operations in a set-associative cache that employs parallel reads.
Much of modern progress is associated with the increasing prevalence of computers. In a conventional computer architecture, a data processor manipulates data in accordance with program instructions. The data and instructions are read from, written to, and stored in the computer""s xe2x80x9cmainxe2x80x9d memory. Typically, main memory is in the form of random-access memory (RAM) modules.
A processor accesses main memory by asserting an address associated with a memory location. For example, a 32-bit address can select any one of up to 232 address locations. In this example, each location holds eight bits, i.e., one xe2x80x9cbytexe2x80x9d of data, arranged in xe2x80x9cwordsxe2x80x9d of four bytes each, arranged in xe2x80x9clinesxe2x80x9d of four words each. In all, there are 230 word locations, and 228 line locations.
Accessing main memory tends to be much faster than accessing disk and tape-based memories; nonetheless, even main memory accesses can leave a processor idling while it waits for a request to be fulfilled. To minimize such latencies, a cache can intercept processor requests to main memory and attempt to fulfill them faster than main memory can.
To fulfill processor requests to main memory, caches must contain copies of data stored in main memory. In part to optimize access times, a cache is typically much less capacious than main memory. Accordingly, it can represent only a small fraction of main-memory contents at any given time. To optimize the performance gain achievable by a cache, this small fraction must be selected strategically.
In the event of a cache xe2x80x9cmissxe2x80x9d, i.e., when a request cannot be fulfilled by a cache, the cache fetches an entire line of main memory including the memory location requested by the processor. Addresses near a requested address are more likely than average to be requested in the near future. By fetching and storing an entire line, the cache acquires not only the contents of the requested main-memory location, but also the contents of the main-memory locations that are relatively likely to be requested in the near future.
Where the fetched line is stored within the cache depends on the cache type. A fully-associative cache can store the fetched line in any cache storage location. Typically, any location not containing valid data is given priority as a target storage location for a fetched line. If all cache locations have valid data, the location with the data least likely to be requested in the near term can be selected as the target storage location. For example, the fetched line might be stored in the location with the least recently used data.
The fully-associative cache stores not only the data in the line, but also stores the line-address (the most-significant 28 bits) of the address as a xe2x80x9ctagxe2x80x9d in association with the line of data. The next time the processor asserts a main-memory address, the cache compares that address with all the tags stored in the cache. If a match is found, the requested data is provided to the processor from the cache.
In a fully-associative cache, every cache-memory location must be checked for a tag match. Such an exhaustive match checking process can be time-consuming, making it hard to achieve the access speed gains desired of a cache. Another problem with a fully-associative cache is that the tags consume a relatively large percentage of cache capacity, which is limited to ensure high-speed accesses.
In a direct-mapped cache, each cache storage location is given an index which, for example, might correspond to the least-significant line-address bits. For example, in the 32-bit address example, a six-bit index might correspond to address bits 23-28. A restriction is imposed that a line fetched from main memory can only be stored at the cache location with an index that matches bits 23-28 of the requested address. Since those six bits are known, only the first 22 bits are needed as a tag. Thus, less cache capacity is devoted to tags. Also, when the processor asserts an address, only one cache location (the one with an index matching the corresponding bits of the address asserted by the processor) needs to be examined to determine whether or not the request can be fulfilled from the cache.
In a direct-mapped cache, a line fetched in response to a cache miss must be stored at the one location having an index matching the index portion of the read address. Previously written data at that location is overwritten. If the overwritten data is subsequently requested, it must be fetched from main memory. Thus, a directed-mapped cache can force the overwritting of data that may be likely to be requested in the near future. The lack of flexibility in choosing the data to be overwritten limits the effectiveness of a direct-mapped cache.
A set-associative cache has memory divided into two or more direct-mapped sets. Each index is associated with one memory location in each set. Thus, in a four-way set associative cache, there are four cache locations with the same index, and thus, four choices of locations to overwrite when a line is stored in the cache. This allows more optimal replacement strategies than are available for direct-mappped caches. Still, the number of locations that must be checked, e.g., one per set, to determine whether a requested location is represented in the cache is quite limited, and the number of bits that need to be compared is reduced by the length of the index. Thus, set-associative caches combine some of the replacement strategy flexibility of a fully-associative cache with much of the speed advantage of a direct-mapped cache.
The index portion of an asserted address identifies one cache-line location within each cache set. The tag portion of the asserted address can be compared with the tags at the identified cache-line locations to determine whether there is a hit (i.e., tag match) and, if so, in what set the hit occurs. If there is a hit, the least-significant address bits are checked for the requested location within the line; the data at that location is then provided to the processor to fulfill the read request.
A read operation can be hastened by starting the data access before a tag match is determined. While checking the relevant tags for a match, the appropriately indexed data locations within each set are accessed in parallel. By the time a match is determined, data from all four sets are ready for transmission. The match is used, e.g., as the control input to a multiplexer, to select the data actually transmitted. If there is no match, none of the data is transmitted.
The parallel read operation is much faster since the data is accessed at the same time as the match operation is conducted rather than after. For example, a parallel xe2x80x9ctag-and-dataxe2x80x9d read operation might consume only one memory cycle, while a serial xe2x80x9ctag-the-dataxe2x80x9d read operation might require two cycles. Alternatively, if the serial read operation consumes only one cycle, the parallel read operation permits a shorter cycle, allowing for more processor operations per unit of time.
The gains of the parallel tag-and-data reads are not without some cost. The data accesses to the sets that do not provide the requested data consume additional power that can tax power sources and dissipate extra heat. The heat can fatigue, impair, and damage the incorporating integrated circuit and proximal components. Accordingly, larger batteries or power supplies and more substantial heat removal provisions may be required. What is needed is a cache-management method that achieves the speed advantages of parallel reads but with reduced power consumption.
The present invention provides for preselection of a set from which data is to be read. The preselection is based on a tag match with a preceeding read. In this case, it is not necessary to access all sets, but only the preselected set. When only one set is selected, a power saving accrues.
The invention provides for comparing a present line address with the line address asserted in an immediately preceding read operation. If the line addresses match, a single-set read can be implemented instead of a parallel read.
The invention provides for checking one or more line locations in a set other than the location used to satisfy a current request for a tag match. A tag match at such a xe2x80x9csecondxe2x80x9d location does not result immediately in included data being accessed; instead a flag (or other indicator) is set indicating the tag match. This indication is used in an immediately succeeding read operation to determine whether the second line location can be preselected for a single-set read operation. If the tag portion of the next requested address matches the tag portion of the previously requested address, and the latter was matched by the tag at the second location, a single-set read can be performed.
The invention has special application to computer systems that have a processor that indicates whether a read address is sequential or non-sequential. By default, e.g., when a read is non-sequential, a parallel read is implemented. If the read is sequential to a previous read that resulted in a cache hit, the type of read can depend on word position within the cache line.
If the word position is not at the beginning of the cache line, then the tag is unchanged. Thus, a hit at the same index and set is assured. Accordingly, a xe2x80x9csame-setxe2x80x9d read is used. However, if the word position is at the beginning of a line, the index is different and a different tag may be stored at the indexed location. Accordingly, a parallel read can be used.
In a further refinement, if a read that is sequential to a read resulting in a hit corresponds to the end of a cache line, the next index location can be checked. This makes use of the tag-match circuitry that would otherwise be idle in the sequential read. The tag matching can be limited to only the set selected for the current read; alternatively, all sets can be checked. If the next read is sequential, it will correspond to the beginning of a line. However, the tag matching for this read will already have been completed. Accordingly, a single-set read can be performed.
For many read operations, the present invention accesses only one set instead of all the sets that are accessed in a parallel read operation. Yet, there is no time penalty associated with the single-set reads provided by the invention. Thus, the power savings of single-set reads are achieved without sacrificing the speed advantages of the parallel reads. These and other features and advantages of the invention are apparent from the description below with reference to the following drawings.