1. Technical Field
The present invention is generally directed to an improved computing system. More specifically, the present invention is directed to a mechanism for coordinating dynamical memory deallocation with redundant bit line steering to correct errors in memory.
2. Description of Related Art
Protecting a system from memory errors becomes increasingly more important as the total amount of memory in a system increases. Different techniques have been used to increase the overall reliability of a system in the face of memory errors. Generally, these techniques can be categorized into one of three main areas: tolerating the memory fault, i.e. error, fixing the memory fault, and avoiding the memory fault.
Several techniques can be used to tolerate memory faults in a system. One such technique is the use of an error correcting code (ECC) memory. An ECC memory is a memory system that tests for and corrects errors automatically, very often without the operating system or the user being aware of it. When writing the data into memory, ECC circuitry generates checksums from the binary sequences in the bytes and stores them in an additional seven bits of memory for 32-bit data paths or eight bits for 64-bit paths (other ECCs may use 12 or 16 bits, for example). When data are retrieved from memory, the checksum is recomputed to determine if any of the data bits have been corrupted. Such systems can typically detect and automatically correct errors of one bit per word and can detect, but not correct, errors greater than one bit. A memory word that is protected with ECC is referred to herein as an ECC word.
Another technique is bit-scattering, sometimes known as Chipkill detection and correction. Bit-scattering is a technique of allocating bits within an ECC word such that any given ECC word contains no more than one bit from a given memory module. This technique ensures that even a catastrophic failure of a memory module, while it may cause multiple ECC words to have a correctable error, cannot by itself result in an unrecoverable memory error.
These techniques, while they correct the data actually used by the system, do not eliminate the faults at the memory module level. That is, with these techniques, a system that experienced a catastrophic memory module failure would constantly have a correctable error in each ECC word it contributes to. Any fault in any other module in any of these ECC words would then result in an uncorrectable error.
Another technique for tolerating memory faults is to use memory-mirroring. Memory mirroring is a technique that requires having double the amount of memory in a system than will logically be seen by the operating system. Each memory write would actually be sent to two different ECC words in separate memory hardware. An “uncorrectable error” in an ECC word would not be uncorrectable in such a system because the word with the error would be refetched from the redundant ECC word. This technique gives very high tolerance to errors, but is an expensive approach especially for systems having a large amount of memory.
If a memory error is a random soft event, i.e. a fixable event, such as that caused by an alpha or cosmic ray particle, it is possible to fix the memory fault so that it is not encountered again. This can be done when a correctable memory error is encountered in the computing system. It can also be done proactively before the memory with the fault is accessed by the operating system or system firmware.
The most common technique for fixing a memory fault of this type is to perform memory scrubbing. Memory scrubbing is a technique for proactively correcting soft event memory faults. Memory scrubbing involves reading memory in a system, looking for an error, and writing back good “ECC corrected” data when an error is found.
Memory scrubbing can be accomplished by hardware in the background of the operating system during system operation. In such a technique, all of the memory in the system can be scrubbed regardless of how the memory is used by any software layer. Scrubbing can be performed ideally without a performance impact. Some hardware scrubbing mechanisms may also be superior to software techniques in that they can tolerate encountering uncorrectable errors when reading memory during a scrub cycle and potentially fix one of the bits in the uncorrectable error before system software ever accessed the faulty ECC word.
However, if a system has an uncorrectable memory error, it is vital that the system have a mechanism for avoiding the memory fault. If the fault remains in the system memory, it is worthwhile to avoid the fault, even though the fault may be correctable, to prevent a future alignment of the correctable fault with another correctable error in the same ECC word which would result in an uncorrectable error. Some techniques for avoiding a memory fault include redundancy and deallocation.
Redundancy is perhaps the best mechanism for avoiding a memory fault and involves substituting good memory for the faulty memory. This requires that there be some amount of redundant memory available. From a protection point of view, the best case is to have full memory redundancy. In systems with full memory redundancy, each memory write can be mirrored to a redundant module allowing complete memory protection even for uncorrectable errors. Having full memory redundancy, however, is the most expensive technique for providing memory protection and is often not practical in large system environments where memory becomes too expensive to completely duplicate for protection purposes.
Other schemes for redundancy allow for some extra memory to be included in the system and used when needed. One such technique is redundant bit steering, or redundant bit line steering. Redundant bit steering presumes having at least one spare memory bit in a memory module. In this scheme, a memory module with a bad system memory bit could have the bit excluded from an ECC word and replaced with a system memory bit from a spare memory module. Having an entire spare memory module ensures that a catastrophic failure of a memory module could be entirely repaired by replacing each system bit with that from the spare memory module.
Absent actual redundancy, another mechanism for avoiding a memory fault is to not let the system make use of the memory with the fault, i.e. deallocating the memory that has the fault. This is known as deallocation of memory and is typically done only in hardware when a system is restarted. Alternatively, deallocation of memory may be performed in software dynamically during system operation with the cooperation of the operating system.
Dynamic deallocation may be performed by allowing all of the memory to be available to the operating system, but to communicate to the operating system what portion of the memory should be avoided. This is typically done in terms of memory “pages” where a memory page is a fixed-size collection of memory words at successive memory addresses. Thus, the deallocation of memory pages is referred to as memory page deallocation or dynamic memory page deallocation if it can be done during system operation when a memory fault is detected without having to restart the computing system or operating system.
Memory page deallocation may provide advantages over simply deallocating memory at the hardware level. Generally, memory page deallocation allows a smaller amount of memory to be deallocated than is possible to be deconfigured at the hardware level. Hardware deallocation of memory can also affect the way the different ECC word are interleaved at an address level and this may affect the performance of the computing system.
Generally, redundant bit steering is implemented to remove errors that are more severe than single cell, i.e. single bit, faults. Redundant bit steering is superior to dynamic memory page deallocation in that the memory is removed without changing the amount of memory available to the computing system. This allows memory with such a fault to be repaired, possibly even without requiring that the memory be removed from the system.
Using dynamic memory page deallocation to remove an entire failed memory module results in a large number of memory pages being deleted and may be practically prohibitive. Certainly any such memory repair, if attempted, would be considered temporary only and the failed memory would need to be replaced. On the other hand, removing the occasional single cell fault generally results in very minimal impact to the overall performance of the computing system and may be tolerated permanently in a system.
In the prior art, redundant bit steering and dynamic memory page deallocation are mutually exclusive. That is, a system either uses redundant bit steering or dynamic memory page deallocation. There is no known system that provides a combined approach to avoiding memory faults. However, the availability of a computing system may be improved by combining both approaches. Therefore, it would be beneficial to have a method and apparatus for coordinating dynamic memory page deallocation with redundant bit steering to avoid memory faults.