Exemplary embodiments relate to memory systems and more particularly relate to methods, systems, and computer program products to recover from memory failures without the high overhead.
Memory mirroring or redundant array of independent disks (RAID) has been used in computer designs to improve overall computer system availability in hard disk drives (HDD).
Mirroring improves availability by storing two copies of the data, each a mirrored image of the other, so that in the event of a failure that data can be recovered by using the good mirrored copy. Accordingly, it is important to be able to detect and pinpoint data errors to know that a mirrored copy should be used. Mirroring is very powerful in that it enables a system to recover from even some fairly catastrophic memory failures. Recovery from full DIMM failure or from even failures of greater sections of the computer system memory can be achieved, so long as the computer system can detect and pinpoint the failure and the still functional part of the memory can be accessed to retrieve the data from the mirrored copy. If these conditions hold true, the computer system can recover from the failure and continue normal operation.
When some computer systems are designed to allow memory mirroring, the computer system is sometimes also designed with concurrent repair capability to avoid the down time associated with a scheduled repair. Without concurrent repair, a system with memory mirroring can survive many types of memory failures. However, the system has to be powered down at some point in time to replace the defective memory to restore the system to full capability. If a secondary memory fault is encountered before the repair that aligns with the first memory failure, the combination of both memory failures could take out both copies of the data and cause an unscheduled computer system outage. Systems designed with concurrent repair capability allow a failed section of memory to be replaced during run time, which is during normal system operation. Once the failed portion of memory is replaced, a mirrored copy of the data is rewritten to the new memory restoring the data copy and thus allowing the system to regain full recovery capabilities.
Nevertheless, as with most engineering problems, improving one system attribute, such as system availability, requires loosing capability or trading off capability in another area. Mirroring is no exception. The substantial availability gains that are realized with memory mirroring reduce the usable memory area by more than 50%. This is easy to see in that the mirrored copy of data requires that half of the available system memory space be used to hold the copy. In addition to the overhead to store the data copy, some mechanism to detect errors, know which copy has the error, and pinpoint the error is required. Many different detection mechanisms have been devised, such as detection bits, ECC (Error Correction Codes), or simple parity. These checker bits are associated with different, smaller sections of memory such as words or cache lines. The checksums are calculated across these smaller sections of memory and stored with the data. When the data is accessed, the checksums are recalculated and compared to the stored checksums. Normally, these schemes do not provide 100% detection of all bit pattern failure, but the detection accuracy is usually high. If the stored and recalculated checksums match, the data is assumed to be good; if they do not match, the data is assumed to be bad. In this way, most memory failures can be pinpointed and the mirrored copy of the data can be used to recover from the failure. Simply knowing that one copy of data does not match the other is insufficient. We also must know which mirrored copy contains the error, and thus, the usable memory area for mirroring is <50% of the physical memory capacity.
Computer system memory is still fairly expensive with a far higher cost per megabyte than hard disk drives (HDD), so memory mirroring when offered as a customer selectable feature has not been widely adopted. With a relatively high cost and total computer memory size continuing to grow, (single large computer system can now have over a terabyte of memory), it is not surprising that few if any customers elect to use memory mirroring as a feature.
Some companies have more recently begun to offer simple Reed-Solomon error correction schemes the can handle greater numbers of adjacent bit failures, but most of these cannot recover from a full dual in-line memory module (DIMM) failure. A DIMM is a thin rectangular card with several memory chips mounted on the cards. DIMMs are often designed with dynamic memory chips that need to be regularly refreshed to prevent the data it is holding from being lost. Unfortunately, as we continue to improve the overall performance of computer systems by pushing the limits of memory technology relative to bit density, access time, cost, and temperature, the likelihood of experiencing more catastrophic memory failures continues to increase proportionately.
In addition to simple Reed-Solomon error correction schemes, there are also RAID memory offerings that have been designed to handle a full DIMM failure. However, while not as significant as with mirroring, these schemes too can require a fairly large overhead. The impact to usable memory space can easily be 30% or more, and often flexibility is lost in that it can be difficult to have a common design that can be easily extended to accommodate changes in the underlying memory technologies as they change. As memory chips continue to evolve from DDR to DDR2 to DDR3, as x4 or x8 chips are used, and as cache line size varies, completely new RAID memory design may be required.
Another very important computer system attribute that can easily be overlooked is that not all memory failures are equal. Some memory failures may not matter at all if the portion of memory where the failure is experienced is not being used to store critical data. For example the memory might contain old data or that section of memory may just have not yet been used. The data stored in memory must be read to detect the error and there are scrubbing routines that do exactly that today. The scrubbing routines read unused sections of memory to attempt to detect and deal with memory faults before critical data is stored in these locations. Reading this unimportant data allows the error to be detected and dealt with before it holds critical information.
Other memory failures might impact just a single application program and thus may have only a minor impact to the full computer system operation. Large servers and mainframes, for example, may have hundreds of users with only a small number using a particular application in the specific section of memory where the memory fault is encountered. These types of memory faults do not impact the full set of users. In some cases, these errors may impact only a single user.
Still other memory failures might cause errors in a key application such as a data base application which could impact many or perhaps even all users. Other failures might take down an operating system and thus impact all the users associated with that operating system image. While still other failures, say in a large logically partitioned system for example, can take out multiple operating system images and might bring down the entire system affecting all applications and users.
Understanding the scope of the failure is important because recovering from the more minor errors might simply require the application to be rebooted which can be done without affecting the remaining running applications or users. The vast majority of users will have no indication that a failure has even occurred during the recovery process. On the other hand, if the entire system has to be rebooted, everyone is affected and if the database has to be restored, this can be a long time-consuming recovery process.
It would be beneficial to have methods, systems, and computer program products to recover from memory failures without the high overhead.