The present. invention relates to computers and, more particularly, to a method for managing a set-associative cache. A major objective of the present invention is to reduce average cache access times.
Much of modern progress is associated with the increasing prevalence. of computers. In a conventional computer architecture, a data processor manipulates data in accordance with program instructions. The data and instructions are read from, written to, and stored in the computer""s xe2x80x9cmainxe2x80x9d memory. Typically, main memory is in the form of random-access memory (RAM) modules.
A processor access main memory by asserting an address associated with a memory location. For example, a 32-bit address can select any one of up to 232 address locations. In this example, each location holds eight bits, i.e., one xe2x80x9cbytexe2x80x9d of data, arranged in xe2x80x9cwordsxe2x80x9d of four bytes each, arranged in xe2x80x9clinesxe2x80x9d of four words each. In other words, there are 230 word locations, and 228 line locations.
Accessing main memory tends to be much faster than accessing disk and tape-based memories; nonetheless, main memory accesses can leave a processor idling while it waits for a request to be fulfilled. To minimize such latencies, cache memories intercept processor requests to main memory and attempt to fulfill them faster than main memory can.
To fulfill processor requests to main memory, caches must contain copies of data stored in main memory. In part to optimize access times, a cache is typically much less capacious than main memory. Accordingly, it can represent only a small fraction of main memory contents at any given time. To optimize the performance gain achievable by a cache, this small fraction must be carefully selected.
In the event of a cache xe2x80x9cmissxe2x80x9d, when a request cannot be fulfilled by a cache, the cache fetches the entire line of main memory including the memory location requested by the processor. This entire line is stored in the cache since a processor is relatively likely to request data from locations that are near a location from which a recently made request was made. Where the line is stored depends on the type of cache.
A fully associative cache can store the fetched line in any cache storage location. The fully associative cache stores not only the data in the line, but also stores the line-address (the most-significant 28 bits) of the address as a xe2x80x9ctagxe2x80x9d in association with the line of data. The next time the processor asserts a main-memory address, the cache compares that address with all the tags stored in the cache. If a match is found, the requested data is provided to the processor from the cache.
There are two problems with a fully associative cache. The first is that the tags consume a relatively large percentage of cache capacity, which is limited to ensure high-speed accesses. The second problem is that every cache memory location must be checked to determine whether there is a tag that matches a requested address. Such an exhaustive match checking process can be time-consuming, making it hard to achieve the access speed gains desired of a cache.
In a direct-mapped cache, each cache storage location is given an index which, for example, might correspond to the least-significant line-address bits. For example, in the 32-bit address example, a six-bit index might correspond to address bits 23-28. A restriction is imposed that a line fetched from main memory can only be stored at the cache location with an index that matches bits 23-28 of the requested address. Since those six bits are known, only the first 22 bits are needed as a tag. Thus, less cache capacity is devoted to tags. Also, when the processor asserts an address, only one. cache location (the one with an index matching the corresponding bits of the address asserted by the processor) needs to be examined to determine whether or not the request can be fulfilled from the cache.
The problem with a direct-mapped cache is that when a line is stored in the cache, it must overwrite any data stored at that location. If the data overwritten is data that would be likely to be called in the near term, this overwriting diminishes the effectiveness of the cache. A direct-mapped cache does not provide the flexibility to choose which data is to be overwritten to make room for new data.
In a set-associative cache, the memory is divided into two or more direct-mapped sets. Each index is associated with one memory location in each set. Thus, in a four-way set associative cache, there are four cache locations with the same index, and thus, four choices of locations to overwrite when a line is stored in the cache. This allows more optimal replacement strategies than are available for direct-mapped caches. Still, the number of locations that must be checked, e.g., one per set, to determine whether a requested location is represented in the cache is quite limited, and the number of bits that need to be compared is reduced by the length of the index. Thus, set-associative caches provide an attractive compromise that combines some of the replacement strategy flexibility of a fully associative cache with much of the speed advantage of a direct-mapped cache.
When a set-associative cache receives an address from a processor, it determines the relevant cache locations by selecting the cache locations with an index that matches the corresponding address bits. The tags stored at the cache locations corresponding to that index are checked for a match. If there is a match, the. least-significant address bits are checked for the word location (or fraction thereof) within the line stored at the match location. The contents at that location are then accessed and transmitted to the processor.
In the case of a read operation, the cache access can be hastened by starting the data access before a match is determined. While checking the relevant tags for a match, the appropriate data locations within each set having the appropriate index are accessed. By the time a match is determined, data from all four sets are ready for transmission. The match is used, e.g., as the control input to a multiplexer, to select the data actually transmitted. If there is no match, none of the data is transmitted.
The read operation is much faster since the data is accessed at the same time as the match operation is conducted rather than after. For example, a parallel xe2x80x9ctag-and-dataxe2x80x9d read operation might consume only one memory cycle, while a serial xe2x80x9ctag-then-dataxe2x80x9d read operation might require two cycles. Alternatively, if the serial read operation consumed only one cycle, the parallel read operation would permit a shorter cycle, allowing for more processor operations per unit of time.
The gains of the parallel tag-and-data reads are not without some cost. The data accesses that do not provide the requested data consume additional power that can tax power sources and dissipate extra heat. The heat can fatigue, impair, and damage the incorporating integrated circuit and proximal components. Accordingly, larger batteries or power supplies and more substantial heat removal provisions may be required.
Nonetheless, such provisions are generally well worth the speed advantages of the parallel tag-and-data read accesses. A comparable approach to hastening write operations is desired. Unfortunately, the parallel tag-and-data approach is not applied to write operations since parallel data access would involve overwriting data that should be preserved. Accordingly, in the context of a system using parallel reads, the write operations have become a more salient limit to performance. What is needed is a cache management method in which write operation times more closely match those achieved using parallel tag-and-data reads.
The present invention provides a cache-management method using pipelined cache writes in conjunction with parallel tag-and-data reads. Thus, the cache can accept a second write request while writing the data from the previous write request. While each write operation takes place over two cycles, a series of write operations consumes much less than two cycles per write on average.
There are three possible types of processor events that can follow a pipelined write operation on the next cycle. The first type is a write operation, in which case the pipelining obtains the expected performance benefit. The second type is a xe2x80x9cno-operationxe2x80x9d, neither a read nor a write. In this case, the pipelining allows the no-op to be executed during the second stage of a preceding write operation.
The third type is a read operation. In the case a parallel read follows a pipelined write, the read would not be begun until the write was completed. Thus, a wait might be inserted in the first stage of the write pipeline during the second cycle of the write. In this case, the speed advantage of pipelining is not realized.
Surprisingly, a pipelined write permits a power savings for an immediately following read operation. Instead of waiting until the second cycle of the write operation is completed to begin a parallel read, a serial read can be started in the second cycle of the write operation. The serial read can be completed in the first cycle after completion of the write operation, i.e., by the time a one-cycle parallel read operation would have been completed. Hence, the power savings associated with a serial read are achieved with none of the time penalty normally associated with serial reads. Thus, the invention provides both for faster average write access rates and for reduced power consumption. These and other features and advantages of the invention are apparent from the description below with reference to the following drawing.