This invention relates generally to computer memory and more particularly, to heterogeneous recovery in a redundant memory system.
Redundant array of independent memory (RAIM) systems have been developed to improve performance and/or to increase the availability of storage systems. RAIM distributes data across several independent memory channels (e.g., made up of memory modules each containing one or more memory devices). There are many different RAIM schemes that have been developed each having different characteristics, and different pros and cons associated with them. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold customer data) are perhaps the most important. The tradeoffs associated with various schemes have to be carefully considered because improvements in one attribute can often result in reductions in another.
With the movement in high speed memory systems towards the use of differential drivers, the number of logical bus wires has been effectively cut in half. This makes the use of error correction code (ECC) protection across multiple channels of a memory more expensive as the use of ECC causes an either further reduction in the number of bits of data that are transferred in each packet or frame across the channel. An alternative is the use of CRC on channel busses to detect errors. However, since CRC is detectable but not correctable at the bus-level, soft or hard errors detected on the busses require a retry of the failing operations at the bus level. Typically, this means retrying fetches and retrying stores to memory.
For stores, the buffers containing the store data merely have to hold the data until it is certain that the data has been stored. The store commands and data can be resent to the memory interface.
For fetches, the line of data can merely be refetched from memory. However, consideration has to be given to the various recovery scenarios. For instance, if a double line of data (e.g., 256 bytes) is required from memory but ECC is only across a quarter of a line (e.g., 64 bytes), consideration must be given to the error scenarios. If the error occurs on the first 64 bytes, the data can be refetched and the entire 256 byte line can be delayed by the recovery time. However, if there is no error until the third quarter line is fetched, a decision has to be made about how to handle the first half of the line. For latency reasons, it may be advantageous to send the quarter lines as they are fetched. However, this means that any error on a quarter line will cause a gap while waiting for that quarter line. If the hardware does not have separate address/protocol tags for each quarter line, then there will be gaps on the fetch data, and the system may not be designed to handle gaps on the fetch data. One approach to avoid the gaps is delay the entire line until all the ECC is clean. A drawback to this approach is that it would cause undue latency on the line that would have to be incurred on all lines, not just those with errors.
Some recovery steps (e.g., data recalibration and clock recalibration) may take so long that they could hang system operations. For instance, the time it takes to spare a clock lane, recalibrate, and lock a phase locked loop (PLL) could be on the order of 10 ms. However, some systems can only tolerate hang periods of less than 1 ms.
Accordingly, and while existing techniques for dealing with recovery in a memory system may be suitable for their intended purpose, there remains a need in the art for error recovery schemes in a memory system that overcome these drawbacks of introducing fetch gaps while also avoiding additional latency caused by speculation in the recovery of errors and also avoiding system hangs introduced during long recovery sequences.