1. Field of the Invention
The present invention generally relates to a method and system for providing an improved store-in cache, and more particularly, to the operation of stores in a cache system and the reliable maintenance of locally modified data in such a cache system.
2. Description of the Conventional Art
Caches are categorized according to many different parameters, each of which has its own implications on performance, power, design complexity, and limitations of use. One of the major parameters used is the Store Policy, which determines how stores to the cache are handled. Such a Store Policy includes two basic approaches, called Store-In and Store-Through.
When storing into a Store-In cache, that is all that one needs to do: store into it. This is exceedingly simple. However, the directory entry for any line that has been stored to must have a status bit (sometimes called a “dirty bit”) to indicate that the contents of the line have been changed. When a store has not been percolated into the rest of the cache hierarchy, but has simply been stored into, then the local cache has the most recent, hence the only valid copy of the new data.
This means that if a remote processor attempts to reference this line, it will miss in its local cache, and it must get the only valid copy from the only place that exists—which is the local cache of the processor that last stored into the line. It further means that if a cache selects a line for replacement that has its “dirty bit” set, the modified line cannot simply be overwritten. First, the modified line has to be written back to the next cache level in the hierarchy. This operation is called a “Castout.”
Usually, a Castout is done by moving the modified line into a “Castout Buffer,” then waiting for the bus (to the next level in the cache hierarchy) to become available (because it should be busy bringing in the new line to replace the Castout), and then moving the line out of the Castout Buffer and over the bus to the next cache level. While a Castout sounds like it is a lot of trouble because it is a new operation that needs to be done, in fact the effect of Castouts is to reduce the overall traffic. This is because most lines that get modified get modified repeatedly. The Castout essentially aggregates these multiple modifications into a single transfer—unlike what occurs in the second approach to the Store Policy, which is a Store-Through approach.
In a Store-Through cache, when data is stored into the local cache, it is also “stored through” the cache, which means that it is stored into the next level of cache too. Thus the total store bandwidth coming out of a Store-Through cache is higher, since every store goes through it. It is noted that a Store-In cache has the effect of aggregating multiple stores made to the same location. It is also noted that, with a Store-Through cache, not only does it have the most recent copy of the stored data, but the next layer of cache in the cache hierarchy has it as well. This means that remote misses can be serviced directly from the next layer of cache in the hierarchy (which may be quicker), and it also means that soft-errors occurring in the lower level of cache are not fatal, since valid data exists in the next level above it.
Conventionally, server processors used for reliable applications all have Store-Through L1 caches, which means that each store made by the processor is done to both its L1 cache and to the next cache level in the hierarchy. This is precisely to protect against soft errors in modified L1 lines, which works because there is a recoverable copy of the data further up in the cache hierarchy.
Of course, having a Store-Through L1 cache would not be a requirement for reliability if Error Correcting Codes (ECC) were used at the L1 level, but this is very difficult to do for the following reason. Many stores in database applications are single byte stores. Maintaining ECC on a byte granularity requires 3 additional bits per byte, which is quite costly.
The alternative to using byte-ECC is to use doubleword (8 byte) ECC, which requires 8 bits per doubleword—the same overhead as byte parity. However, doubleword ECC would require a longer pipeline for byte store instructions, because the ECC would need to be regenerated for the entire doubleword containing the byte. Doing a byte store would no longer simply be a matter of storing a byte. Instead, it first would require reading out the original doubleword, then doing an ECC check to verify that the data in the doubleword is good, then merging the new byte into the doubleword, then regenerating the ECC for the modified doubleword, and finally, storing the new doubleword back. The performance lost to this longer pipeline can be significant.
In some cases, for performance reasons it is more desirable to have the L1 be a Store-In cache. In a Store-In cache, stores do not percolate through the L1 into the rest of the hierarchy, but instead are accumulated in the L1 lines. The only event in which data is written up to the next level in the hierarchy is if a modified line is chosen (by the L1) for replacement, i.e., for a Castout. In this case, the entire line is written out to the next cache level in the hierarchy.
One reason that this is desirable is that the higher levels in the hierarchy are shielded from the raw store bandwidth. Another reason is that certain optimizations can be made in higher levels of the hierarchy if they need only deal with a single store quanta (e.g., just lines as opposed to both lines and doublewords).
In conventional systems and methods, even when a Store-In cache is preferable, such is not an option if reliable operation is a requirement. The present invention overcomes the above problems.
Some conventional Store-In and Store-Through cache implementations are described below.
FIG. 1 exemplarily shows a processor 100 with an existing-art Store-In cache 101. For the cache to be able to fetch and store data to the next level, the system includes a Bus Interface Unit (BIU) 102. The system also includes a Castout Buffer (COB) 103 for managing Castouts.
In the exemplary arrangement illustrated in FIG. 1, the processor 100 need not be concerned with the machinations of the BIU 102 or the COB 103. Instead, the processor 100 interacts only with the cache 101 itself. When the processor 100 fetches from the cache 101, the processor 100 receives doublewords (with byte parity), but when the processor 101 stores, it can store data (again, with byte parity) on an individual byte granularity.
When there is a cache miss, the cache 101 sends the miss transaction to the Bus Interface Unit (BIU) 102. The “transaction” includes the miss address and the desired state of the miss data (meaning shared or exclusive). The BIU 102 forwards this information to the next cache level in the hierarchy as a “miss request.” In the mean time, if the cache selects a line for replacement (by the line that is to be brought in by the miss) that has been modified locally, the modified line needs to be sent to the next cache level to update its copy of the line. To prepare for this, the cache 101 moves the modified line into the Castout Buffer (COB) 103, which notifies the BIU 102 that it has a Castout.
Typically, by the time that the modified line is moved from the cache 101 to the Castout Buffer 103, the BIU 102 will be in the process of handling the incoming line from the miss request, and putting it into the cache 101. Once the incoming line has been completely transferred, the BIU 102 will send the modified line from the Castout Buffer 103 up to the next cache level in the hierarchy (not shown).
Note that the processor 100 interacts with the cache on either a doubleword granularity for fetches, or on a byte granularity for stores.
For purposes of this disclosure, “byte granularity” generally means that the stores can be as small as a single byte, but they can also be multiple bytes, up to a doubleword.
On the other hand, the Bus Interface Unit 102, hence the next cache in the hierarchy (not shown), only works with cache lines, which are typically 128 bytes. That is to say that all transactions to the next cache level in the hierarchy (not shown) are either line fetches or line stores. This means that the next level in the cache hierarchy can be optimized to handle only lines.
The rate of transactions (which are all line transactions) to the next cache level in the hierarchy is the basic L1 miss rate (which are line fetch requests) plus the Castout rate (which are all line store requests). Since only a fraction of the misses will cause Castouts, the Castout rate will be a fraction of the miss rate.
FIG. 2 exemplarily shows a similar processor 200 with an existing-art Store-Through cache 201. As before, the processor 200 interacts with the cache 201 by fetching doublewords and storing bytes—all with parity. But since all stores are to be stored-through, all stores done by the processor 200 begin by transferring the doubleword to which the store will be done to the processor during the normal store-pretest. When the processor 200 stores a byte back to the cache 201, it also merges the byte into the prefetched doubleword, and it sends the modified doubleword to a Pending-Store Buffer (PSB) 203.
The PSB 203 deals only in doublewords. Within the PSB, a doubleword Error-Control Code (ECC) is generated for the doubleword sent by the processor (not shown), and the (now protected) doubleword is buffered until the instruction that did the store operation has been completed.
Typically the ECC is a Single Error Correcting, Double Error Detecting (SECDED) code, which does just what it says: if a single bit is flipped, the ECC will be able to determine which bit it was, and it will correct it; if two bits are flipped, the ECC will be able to detect that the data is bad, but it will not be able to correct the data.
When the store instruction is completed, the processor 200 notifies the PSB 203 that the stored data should be sent to the next cache level in the hierarchy (not shown). The PSB 203 sends a doubleword store request to the BIU 202, which will send the modified doubleword up to the next cache level in the hierarchy (not shown).
Meanwhile, as was the case with the Store-In cache of FIG. 1, if a miss occurs in the Store-Through cache 201 of FIG. 2, the miss address and desired state (shared or exclusive) is sent to the BIU 202, which issues the miss request to the next cache level in the hierarchy. Since all stores have already been sent up into the hierarchy, there is no need to cast out any data, ergo, a Castout Buffer is not needed. The BIU 202 merely manages line misses and doubleword stores.
Note that in this case, there are two granularities of data that are used in the next cache level in the hierarchy. For misses, there are line-oriented fetch requests sent to the next level. These requests occur at the L1 cache 201 miss rate. And for every store issued by the processor 200, there is a doubleword store request sent to the next level in the hierarchy. Thus, the next cache level cannot be optimized for a single data granularity, since it must deal both with lines and with doublewords. Further, the next cache level is subjected to the full store-bandwidth of the processor 200.