This invention relates generally to computer memory, and more particularly to providing a high fault tolerant RAID memory system.
Computer systems often require a considerable amount of high speed RAM (random access memory) to hold information such as operating system software, programs and other data while a computer is powered on and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of 8 bits, although it may be composed of additional bits (e.g. 9, 10, etc.) when the byte also includes information for use in the identification and/or correction of errors. This binary information is normally loaded into RAM from nonvolatile storage (NVS) such as hard disk drives (HDDs) during power on and initial program load (IPL) of the computer system. The data is also paged-in from and paged-out to NVS during normal computer operation. In general, all the programs and information utilized by a computer system cannot fit in the smaller, more costly dynamic RAM (DRAM), and even if it did fit, the data would be lost when the computer system is powered off. At present, it is common for NVS systems to be built using a large number of HDDs.
Computer RAM is often designed with pluggable subsystems, often in the form of modules, so that incremental amounts of RAM can be added to each computer, dictated by the specific memory requirements for each system and application. The acronym, “DIMM” refers to dual in-line memory modules, which are perhaps the most prevalent memory module currently in use. A DIMM is a thin rectangular card comprising one or more memory devices, and may also include one or more of registers, buffers, hub devices, and/or non-volatile storage (e.g., erasable programmable read only memory or “EPROM”) as well as various passive devices (e.g. resistors and capacitors), all mounted to the card. In some instances, devices external to the physical memory module are still used to convey the information from the physical memory module to a memory controller, and failures of these devices can be naturally deemed failures of the memory module itself, as far as an error control code in the memory controller is concerned. Correspondingly, in the claims, a reference to the failure of a memory module applies not only to failures that are completely contained within a physical memory module, but more generally to failures that could affect a component that is required to convey information between the physical memory module and the memory controller.
DIMMs are often designed with dynamic memory chips or DRAMs that need to be regularly refreshed to prevent the data stored within them from being lost. Originally, DRAM chips were asynchronous devices, however contemporary chips, synchronous DRAM (SDRAM) (e.g. single data rate or “SDR”, double data rate or “DDR”, DDR2, DDR3, etc) have synchronous interfaces to improve performance. DDR devices are available that use pre-fetching along with other speed enhancements to improve memory bandwidth and to reduce latency. DDR3, for example, has a standard burst length of 8, where the term burst length refers to the number of DRAM transfers in which information is conveyed from or to the DRAM during a read or write.
Memory device densities have continued to grow as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full DIMM failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have a substantial impact to a business dependent on the computer systems.
The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage in contemporary computers continues to grow.
Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. For example, a data word with an even number of 1's will have a parity bit of 0 and a data word with an odd number of 1's will have a parity bit of 1, with this parity bit data appended to the stored memory data. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
Richard Hamming recognized that the parity technique could be extended to not only detect errors, but correct errors by appending an error correction code (ECC) field, to each code word. The ECC field is a combination of different bits in the word XOR-ed together so that errors (small changes to the data word) can be easily detected, pinpointed and corrected. The number of errors that can be detected and corrected are directly related to the length of the ECC field appended to the data word. The technique includes ensuring a minimum separation distance between valid code word combinations. The greater the number of errors desired to be detected and corrected, the longer the code word, thus creating a greater distance between valid code words. The smallest distance between valid code words is known as the minimum Hamming distance of the code.
These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. The memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles. Similarly, HDDs that store 1's and 0's as magnetic fields on a magnetic surface are also subject to imperfections in the magnetic media and other mechanisms that can cause changes in the data pattern from what was originally stored.
In the 1980's, RAM memory device sizes first reached the point where they became sensitive to alpha particle hits and cosmic rays causing memory bits to flip. These particles do not damage the device but can create memory errors. These are known as soft errors, and most often affect just a single bit. Once identified, the bit failure can be corrected by simply rewriting the memory location. The frequency of soft errors has grown to the point that it has a noticeable impact on overall system reliability.
Memory ECCs, like those proposed by Hamming, use a combination of parity codes in various bit positions of the data word to allow detection and correction of errors. Every time data words are written into memory, a new ECC word needs to be generated and stored with the data, thereby allowing detection and correction of the data in cases where the data read out of memory includes an ECC code that does not match a newly calculated ECC code generated from the data being read.
The first ECCs were applied to RAM in computer systems in an effort to increase fault-tolerance beyond that allowed by previous means. Binary ECC codes were deployed that allowed for double-bit error detection (DED) and single-bit error correction (SEC). This SEC/DED ECC also allows for transparent recovery of single bit hard errors in RAM.
Scrubbing routines were also developed to help reduce memory errors by locating soft errors through a complement/re-complement process so that the soft errors could be detected and corrected.
Some storage manufacturers have used advanced ECC techniques, such as Reed-Solomon codes, to correct for full memory chip failures. Some memory system designs also have standard reserve memory chips (e.g. “spare” chips) that can be automatically introduced in a memory system to replace a faulty chip. These advancements have greatly improved RAM reliability, but as memory size continues to grow and customers' reliability expectations increase, further enhancements are needed. There is the need for systems to survive a complete DIMM failure and for the DIMM to be replaced concurrent with system operation. In addition, other failure modes must be considered which affect single points of failure between the connection between one or more DIMMs and the memory controller/embedded processor. For example, some of the connections between the memory controller and the memory device(s) may include one or more intermediate buffer(s) that may be external to the memory controller and reside on or separate from the DIMM, however upon its failure, may have the effect of appearing as a portion of a single DIMM failure, a full DIMM failure, or a broader memory system failure.
Although there is a clear need to improve computer RAM reliability (also referred to as “fault tolerance”) by using even more advanced error correction techniques, attempts to do this have been hampered by impacts to available customer memory, performance, space, heat, etc. Using redundancy by including extra copies (e.g. “mirroring”) of data or more sophisticated error coding techniques drives up costs, adds complexity to the design, and may impact another key business measure: time-to-market. For example, the simple approach of memory mirroring has been offered as a feature by several storage manufacturing companies. The use of memory mirroring permits systems to survive more catastrophic memory failures, but acceptance has been very low because it generally requires a doubling of the memory size on top of the base SEC/DEC ECC already present in the design, which generally leaves customers with less than 50% of the installed RAM available for system use.
ECC techniques have been used to improve availability of storage systems by correcting HDD failures so that customers do not experience data loss or data integrity issues due to failure of an HDD, while further protecting them from more subtle failure modes.
Some suppliers of storage systems have used redundant array of independent disks (RAID) techniques successfully to improve availability of HDDs to computer RAM. In many respects it is easier to recover from a HDD failure using RAID techniques because it is much easier to isolate the failure in HDDs than it is in RAM. HDDs often have embedded checkers such as ECCs to detect bad sectors. In addition, cyclic redundancy checks (CRCs) and longitudinal redundancy checks (LRCs) may be embedded in HDD electronics or disk adapters, or they may be checkers used by higher levels of code and applications to detect HDD errors. CRCs and LRCs are written coincident with data to help detect data errors. CRCs and LRCs are hashing functions used to produce a small substantially unique bit pattern generated from the data. When the data is read from the HDD, the check sum is regenerated and compared to that stored on the platter. The signatures must match exactly to ensure the data retrieved from the magnetic pattern encoded on the disk is as was originally written to the disk.
RAID systems have been developed to improve performance and/or to increase the availability of disk storage systems. RAID distributes data across several independent HDDs. There are many different RAID schemes that have been developed each having different characteristics, and different pros and cons associated with them. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold customer data) are perhaps the most important. The tradeoffs associated with various schemes have to be carefully considered because improvements in one attribute can often result in reductions in another.
RAID-0 is striping of data across multiple HDDs to improve performance. RAID-1 is mirroring of data, keeping 2 exact copies of the data on 2 different HDDs to improve availability and prevent data loss. Some RAID schemes can be used together to gain combined benefits. For example, RAID-10 is both data striping and mirroring across several HDDs in an array to improve both performance and availability.
RAID-3, RAID-4 and RAID-5 are very similar in that they use a single XOR check sum to correct for a single data element error. RAID-3 is byte-level striping with dedicated parity HDD. RAID-4 uses block level striping with a dedicated parity HDD. RAID-5 is block level striping like RAID-4, but with distributed parity. There is no longer a dedicated parity HDD. Parity is distributed substantially uniformly across all the HDDs, thus eliminating the dedicated parity HDD as a performance bottleneck. The key attribute of RAID-3, RAID-4 and RAID-5 is that they can correct a single data element fault when the location of the fault can be pinpointed through some independent means.
There is not a single universally accepted industry-wide definition for RAID-6. In general, RAID-6 refers to block or byte-level striping with dual checksums. An important attribute of RAID-6 is that it allow for correction of up to 2 data element faults when the faults can be pinpointed through some independent means. It also has the ability to pinpoint and correct a single failure when the location of the failure is not known.
FIG. 1 depicts a contemporary system composed of an integrated processor chip 100, which contains one or more processor elements and an integrated memory controller 110. In the configuration depicted in FIG. 1, multiple independent cascade interconnected memory interface busses 106 are logically aggregated together to operate in unison to support a single independent access request at a higher bandwidth with data and error detection/correction information distributed or “striped” across the parallel busses and associated devices. The memory controller 110 attaches to four narrow/high speed point-to-point memory busses 106, with each bus 106 connecting one of the several unique memory controller interface channels to a cascade interconnect memory subsystem 103 (or memory module, e.g., a DIMM) which includes at least a hub device 104 and one or more memory devices 109. Some systems further enable operations when a subset of the memory busses 106 are populated with memory subsystems 103. In this case, the one or more populated memory busses 108 may operate in unison to support a single access request.
FIG. 2 depicts a memory structure with cascaded memory modules 103 and unidirectional busses 106. One of the functions provided by the hub devices 104 in the memory modules 103 in the cascade structure is a re-drive function to send signals on the unidirectional busses 106 to other memory modules 103 or to the memory controller 110. FIG. 2 includes the memory controller 110 and four memory modules 103, on each of two memory busses 106 (a downstream memory bus with 24 wires and an upstream memory bus with 25 wires), connected to the memory controller 110 in either a direct or cascaded manner. The memory module 103 next to the memory controller 110 is connected to the memory controller 110 in a direct manner. The other memory modules 103 are connected to the memory controller 110 in a cascaded manner. Although not shown in this figure, the memory controller 110 may be integrated in the processor 100 and may connect to more than one memory bus 106 as depicted in FIG. 1.
There is a need in the art to improve failure detection and correction in memory systems. It would be desirable for a memory system to be able to survive a complete DIMM failure and for the DIMM to be replaced concurrent with system operation. It would be desirable to provide this improved level of serviceability along with the ability to substitute a spare memory device for a failing memory device concurrent with system operation. It would also be desirable to survive the failure of any device that communicates information from the memory controller to a DIEM and vice versa.