The present invention relates to computers and, more particularly, to a method for managing a set-associative cache. A major objective of the present invention is to reduce the average power consumed during read operations in a set-associative cache that employs parallel reads.
Much of modern progress is associated with the increasing prevalence of computers. In a conventional computer architecture, a data processor manipulates data in accordance with program instructions. The data and instructions are read from, written to, and stored in the computer""s xe2x80x9cmainxe2x80x9d memory. Typically, main memory is in the form of random-access memory (RAM) modules.
A processor accesses main memory by asserting an address associated with a memory location. For example, a 32-bit address can select any one of up to 232 address locations. In this example, each location holds eight bits, i.e., one xe2x80x9cbytexe2x80x9d of data, arranged in xe2x80x9cwordsxe2x80x9d of four bytes each, arranged in xe2x80x9clinesxe2x80x9d of four words each. In all, there are 230 word locations, and 228 line locations.
Accessing main memory tends to be much faster than accessing disk and tape-based memories; nonetheless, main-memory accesses can leave a processor idling while it waits for a request to be fulfilled. To minimize such latencies, a cache can intercept processor requests to main memory and attempt to fulfill them faster than main memory can.
To fulfill processor requests to main memory, caches must contain copies of data stored in main memory. In part to optimize access times, a cache is typically much less capacious than main memory. Accordingly, it can represent only a small fraction of main-memory contents at any given time. To optimize the performance gain achievable by a cache, this small fraction must be selected strategically.
In the event of a cache xe2x80x9cmissxe2x80x9d, i.e., when a request cannot be fulfilled by a cache, the cache fetches an entire line of main memory including the memory location requested by the processor. Addresses near a requested address are more likely than average to be requested in the near future. By fetching and storing an entire line, the cache acquires not only the contents of the requested main-memory location, but also the contents of the main-memory locations that are relatively likely to be requested in the near future.
Where the fetched line is stored within the cache depends on the cache type. A fully-associative cache can store the fetched line in any cache-storage location. Typically, any location not containing valid data is given priority as a target storage location for a fetched line. If all cache locations have valid data, the location with the data least likely to be requested in the near term can be selected as the target storage location. For example, the fetched line might be stored in the location with the least recently used data.
The fully-associative cache stores not only the data in the line, but also stores the line-address (the most-significant 28 bits) of the address as a xe2x80x9ctagxe2x80x9d in association with the line of data. The next time the processor asserts a main-memory address, the cache compares that address with all the tags stored in the cache. If a match is found, the requested data is provided to the processor from the cache.
In a fully-associative cache, every cache-memory location must be checked for a tag match. Such an exhaustive match checking process can be time consuming, making it hard to achieve the access speed gains desired of a cache. Another problem with a fully-associative cache is that the tags consume a relatively large percentage of cache capacity, which is limited to ensure high-speed accesses.
In a direct-mapped cache, each cache storage location is given an index that, for example, might correspond to the least-significant line-address bits. For example, in the 32-bit address example, a six-bit index might correspond to address bits 23-28. A restriction is imposed that a line fetched from main memory can only be stored at the cache location with an index that matches bits 23-28 of the requested address. Since those six bits are known, only the first 22 bits are needed as a tag. Thus, less cache capacity is devoted to tags. Also, when the processor asserts an address, only one cache location (the one with an index matching the corresponding bits of the address asserted by the processor) needs to be examined to determine whether or not the request can be fulfilled from the cache.
In a direct-mapped cache, a line fetched in response to a cache miss must be stored at the one location having an index matching the index portion of the read address. Previously written data at that location is overwritten. If the overwritten data is subsequently requested, it must be fetched from main memory. Thus, a directed-mapped cache can force the overwriting of data that may be likely to be requested in the near future. The lack of flexibility in choosing the data to be overwritten limits the effectiveness of a direct-mapped cache.
A set-associative cache is divided into two or more direct-mapped memories. A set identification value (xe2x80x9cset IDxe2x80x9d), corresponding to an index for the direct-mapped cache, is associated with one memory location in each set. Thus, in a four-way set associative cache, there are four cache locations with the same set ID, and thus, four choices of locations to overwrite when a line is stored in the cache. This allows more optimal replacement strategies than are available for direct-mapped caches. Still, the number of locations that must be checked, e.g., one per memory, to determine whether a requested location is represented in the cache is quite limited. Also, the number of bits that need to be compared is reduced by the length of the set ID. Thus, set-associative caches combine some of the replacement strategy flexibility of a fully-associative cache with much of the speed advantage of a direct-mapped cache.
The portion of an asserted address corresponding to the set ID identifies one cache-line location within each cache memory. The tag portion of the asserted address can be compared with the tags at the identified cache-memory line locations to determine whether there is a hit (i.e., tag match) and, if so, in which cache memory the hit occurs. If there is a hit, the least-significant address bits are checked for the requested location within the line; the data at that location is then provided to the processor to fulfill the read request.
A read operation can be hastened by starting the data access before a tag match is determined. While checking the relevant tags for a match, the data locations with the proper set ID within each cache memory are accessed in parallel. By the time a match is determined, data from all four memories are ready for transmission. The match is used, e.g., as the control input to a multiplexer, to select the data actually transmitted. If there is no match, none of the data is transmitted.
The parallel-read operation is much faster since the data is accessed at the same time as the match operation is conducted rather than after. For example, a parallel xe2x80x9ctag-and-dataxe2x80x9d read operation might consume only one memory cycle, while a serial xe2x80x9ctag-then-dataxe2x80x9d read operation might require two cycles. Alternatively, if the serial read operation consumes only one cycle, the parallel read operation permits a shorter cycle, allowing for more processor operations per unit of time.
The gains of the parallel tag-and-data reads are not without some cost. The data accesses to the sets that do not provide the requested data consume additional power that can tax power sources and dissipate extra heat. The heat can fatigue, impair, and damage the incorporating integrated circuit and proximal components. Accordingly, larger batteries or power supplies and more substantial heat removal provisions may be required. What is needed is a cache-management method that achieves the speed advantages of parallel reads but with reduced power consumption.
In a context in which parallel reads are performed by default to achieve a performance advantage, the present invention provides for initiating a serial tag-match-then-access read during a wait state. The tag-match is performed during the wait. When the wait is released, the data to be read can be accessed as determined by the tag-match operation. Further tag-then-access reads can be performed in a pipelined fashion.
For at least some processors, the assertion of a wait does not preclude the processor from requesting data. Instead, the wait prevents the processor from recognizing a clock transition that would indicate when the requested read data is valid. The read request cannot be fulfilled while the wait is asserted. However, the tag matching can be performed. By the time the wait is released, the tag match is completed. The tag match data is thus available by the time the data is needed by the processor. Accordingly, only a cache memory having the requested data needs to be accessed. The other cache memories do not need to be accessed. Thus, the power associated with those superfluous accesses can be saved.
For example, consider the case in which a parallel read consumes one system cycle and a serial read consumes two cyclesxe2x80x94the first of which is devoted to the tag-match operation, and the second is devoted to accessing the data as indicated by the tag match. If a wait is asserted for one cycle, a parallel read cannot be implemented until the cycle following the wait. In the case of a serial read, the tag-match can be completed during the wait. In either case, (assuming a cache hit) the read is fulfilled in the second cycle.
Many caches do not provide for initiating a read operation while they are asserting a wait. However, many computer systems have multiple devices that can cause a wait to be asserted. For example, in a Harvard architecture, there can be separate data and iuction caches. A suitable processor can issue a read to one cache while waited due to an incomplete operation involving the other cache. Thus, the power savings afforded by the invention can be especially significant in Harvard and other architectes in which there are multiple devices that can be the cause of a wait being asserted.
The present invention provides for power savings without impairing performance. A parallel read initiated during a wait cannot be completed until the wait is removed. Thus, while a wait is asserted, there is no latency advantage to asserting a parallel read instead of a serial read. Therefore, in such a circumstance, the power savings associated with a serial tag-match-then-access read is achieved without a performance penalty. Further power savings can be achieved by pipelining subsequent read operations. These and other features and advantages are apparent from the description below with reference to the following drawings.