One of the major challenges that traditional mainframe vendors face as personal computers and workstations become more and more powerful is in differentiating their midrange systems from the rapidly advancing smaller machines. One significant area in which mainframe machines can be made distinguishable from the smaller machines is in the area of fault tolerance.
The problem of processor cache storage errors has been a nuisance throughout the history of the use of cache memories in mainframe systems. These errors, as is true with main memory errors, can be caused by either alpha particle impact or transient (or hard) storage element failures. In the exemplary system in which the invention finds application, main memory single bit failures are masked from the system's visibility by specialized hardware in the memory controller that corrects the bit in error before the word associated with the flaw is sent to the requesting unit. However, processor cache failures are not corrected during the cache read activity because correction hardware has not been designed into the processor for a number of reasons such as the limited integrated circuit area available on the VLSI chips.
The advantages of processor cache memories greatly outweigh the complications that arise when they fail. Cache memories offer high-speed access to data and instructions that the processor would have to otherwise fetch from memory on every reference. A cache memory typically takes between 10% to 25% of the time required to access main memory, hence cache memories have gained a permanent position in system data storage hierarchy design.
Any computer design effort that incorporates a cache memory into its central processor unit architecture must address the following progressively more difficult challenges:
1. It is imperative that the processor detect a cache error condition; otherwise data corruption would result. The least expensive solution to this problem is to do nothing more than hang or crash the system when this type of error occurs, but this approach is, as a practical matter, completely unacceptable as a mainframe response.
2. A sophisticated machine should support the deconfiguration of a failing cache storage element. By merely deconfiguring an isolated failing element, the processor can continue to execute without substantial performance loss. Cache memories are divided into blocks that encompass many cache storage elements. In the exemplary machine, the block size is 16 words (64 bytes). Cache memories can also be divided into coarser subdivisions such as levels which, in this context, means a full column of blocks. The exemplary machine has been designed with logic that permits its cache blocks and cache levels to be individually deconfigured.
3. A truly sophisticated machine should ensure that if a processor cache error occurs, either the most recent copy of the block in error can be retrieved from main memory or the block in error can somehow be corrected. In the exemplary machine, an error correction code for correcting cache block single bit errors, invoked only during writes to main memory, has been implemented. But the design of this machine did not address explicit, "unnatural", correction of a specific cache block or restarting of the affected instruction. (The term "unnatural" is used in this context to indicate that it is required that the block be exchanged for correction even though the natural replacement algorithms would not dictate such an event at the time of the error.)
4. Store-into cache machines, such as the exemplary apparatus, are capable of operating extremely efficiently, but their characteristic of delaying writes to main memory which makes them performant is a liability when other processors are added to the system to further improve throughput. The multiple processor configuration leads to the ultimate challenge in cache error processing, that of handling cache operand block errors where the block in error resides in one processor's cache and is required by one or more processors and an "updated" copy of the block does not exist in the system's main memory. This problem, to which the present invention is addressed, is commonly referred to as the siphon error predicament. (Siphon is a term of art used to define the transfer of a cache block from one processor of a multi-processor system to either another processor or to an input/output unit.) The similar problem encountered in single processor systems is addressed by a related invention covered by U.S. patent application Ser. No. 07/708,420, filed on even date herewith, entitled FAULT TOLERANT COMPUTER SYSTEM, by David S. Edwards et al.
Some store-into cache prior art systems have handled the problem of processor cache errors by adding error correction hardware into each cache to correct an error as the data containing the error is read from cache. This is an effective, but expensive, solution to the problem.
A second prior art approach to solving the cache data correction and retry predicament incorporated a technique which masked the problem by implementing a store-through cache. (In store-through designs, when a cache block is updated, it is written to both the cache and immediately to main memory.) With this approach, whenever a fetch from cache is in error, the processor forces a cache bypass and issues a read-to-memory for the block which it will use both in instruction execution and to update (restore) the cache. The advantage of this solution is that the fetch from memory is identical to the cache miss condition such that the affected instruction is not impacted; therefore, all such errors can be recovered. This solution took advantage of the store-through design, which by definition provides the benefit of having main memory always up to date.
Store-into cache designs (commonly known as copy-back caches) are favored for performance oriented systems over store-through designs because they result in less processor-to-memory write activities, hence less main memory traffic which leads to less bottlenecking at the system bus when a bus design is implemented. The store-into characteristic that leads to enhanced performance necessarily results in the cache often containing the only valid copy of a particular block of data in the system. That is, when a cache block has been modified, it is not written back to main memory. Instead it is held by the cache until requested by a second active unit (CPU or I/O Unit) or until the block must be replaced at which time it is written back to main memory to make room in cache for a new block.
It will be apparent to those skilled in the art that it would be highly desirable to achieve, in an alternative approach, the advantages of these prior art solutions to the processor cache error predicament without resorting to the expense and complexity associated with the prior art solutions.