The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, and computer systems may be found in many different settings. Computer systems typically include a combination of hardware, such as semiconductors and circuit boards, and software, also known as computer programs. As advances in semiconductor processing and computer architecture push the performance of the computer hardware higher, more sophisticated and complex computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Today's powerful computer systems often include a large amount of memory. Protecting a system from memory errors becomes increasingly more important as the total amount of memory in a system increases. Different techniques have been used to increase the overall reliability of a system in the face of memory errors. Generally, these techniques can be categorized into one of three main areas: tolerating a correctable memory error, fixing a correctable memory error, and avoiding an uncorrectable memory error.
Several techniques can be used to tolerate correctable memory errors in a system. One such technique is the use of an error correcting code (ECC) memory. An ECC memory is a memory system that tests for and corrects errors automatically, very often without the operating system or the user being aware of the error or the correction. When writing the data into memory, ECC circuitry generates checksums from the binary sequences in the bytes and stores them in an additional seven bits of memory for 32-bit data paths or eight bits for 64-bit paths (other ECCs may use 12 or 16 bits, for example.) When data is retrieved from memory, the checksum is recomputed to determine if any of the data bits have been corrupted. Such systems can typically detect and automatically correct errors of one bit per word and can detect, but not correct, errors greater than one bit. A memory word that is protected with ECC is referred to herein as an ECC word.
Another technique for tolerating memory errors is bit-scattering, sometimes known as Chipkill detection and correction. Bit-scattering is a technique of allocating bits within an ECC word, such that any given ECC word contains no more than one bit from a given memory module. This technique ensures that even a catastrophic failure of a memory module, while it may cause multiple ECC words to have a correctable error, cannot by itself result in an uncorrectable memory error.
The aforementioned techniques, while they correct the data actually used by the system, do not eliminate the errors at the memory module level. That is, with these techniques, a system that experienced a catastrophic memory module failure would constantly have a correctable error in each ECC word to which it contributes. Any error in any other module in any of these ECC words would then result in an uncorrectable error.
Another technique for tolerating memory errors is memory-mirroring. Memory mirroring is a technique that requires double the amount of memory in a system than will logically be seen by the operating system. Each memory write is actually sent to two different ECC words in separate memory hardware. An uncorrectable error in an ECC word would not be uncorrectable in such a system because the word with the error would be re-fetched from the redundant ECC word. This technique gives very high tolerance to errors, but is an expensive approach especially for systems with a large amount of memory.
If a memory error is a random soft event, i.e. a fixable event, such as that caused by an alpha or cosmic ray particle, it is possible to fix the memory error so that it is not encountered again. This can be done when a correctable memory error is encountered. It can also be done proactively before the memory with the error is accessed by the operating system or system firmware. The most common technique for fixing random soft memory errors is memory scrubbing. Memory scrubbing is a technique for proactively correcting soft event memory errors. Memory scrubbing involves reading memory in a system, looking for an error, and writing back good “ECC corrected” data when an error is found.
Memory scrubbing can be accomplished by hardware in the background of the operating system during system operation. In such a technique, all of the memory in the system can be scrubbed regardless of how the memory is used by any software layer. Scrubbing can be performed ideally without a performance impact. Some hardware scrubbing mechanisms may also be superior to software techniques in that they can tolerate encountering uncorrectable errors when reading memory during a scrub cycle and potentially fix one of the bits in the uncorrectable error before system software ever accessed the faulty ECC word.
While the aforementioned techniques deal with correctable memory errors, some errors are uncorrectable, so the system needs a mechanism for avoiding errors that cannot be corrected. In addition, if the error remains in the system memory, it is worthwhile to avoid the error, even though the error may be correctable, to prevent a future alignment of the correctable error with another correctable error in the same ECC word, which would result in an uncorrectable error. Some techniques for avoiding a memory error include redundancy and deallocation.
Redundancy is perhaps the best mechanism for avoiding a memory error and involves substituting good memory for the faulty memory. This requires that there be some amount of redundant memory available. From a protection point of view, the best case is full memory redundancy. In systems with full memory redundancy, each memory write can be mirrored to a redundant module allowing complete memory protection even for uncorrectable errors. Full memory redundancy, however, is the most expensive technique for providing memory protection and is often not practical in large system environments where memory becomes too expensive to completely duplicate for protection purposes.
Other schemes for redundancy allow for some extra memory to be included in the system and used when needed. One such technique is redundant bit steering, or redundant bit line steering. Redundant bit steering presumes that a memory module has at least one spare memory bit. In this scheme, a memory module with a bad system memory bit could have the bit excluded from an ECC word and replaced with a system memory bit from a spare memory module. Having an entire spare memory module ensures that a catastrophic failure of a memory module could be entirely repaired by replacing each system bit with that from the spare memory module.
Absent actual redundancy, another mechanism for avoiding a memory error is to not allow the system to make use of the memory with the error by deallocating the memory that has the error. This mechanism is known as deallocation of memory and is typically done only in hardware when a system is restarted. Alternatively, deallocation of memory may be performed in software dynamically during system operation with the cooperation of the operating system.
Dynamic deallocation may be performed by allowing all of the memory to be available to the operating system, but to communicate to the operating system what portion of the memory should be avoided. This is typically done in terms of memory “pages” where a memory page is a fixed-size collection of memory words at successive memory addresses. Thus, the deallocation of memory pages is referred to as memory page deallocation or dynamic memory page deallocation if it can be done during system operation when a memory error is detected without needing to restart the computing system or operating system.
Memory page deallocation may provide advantages over simply deallocating memory at the hardware level. Generally, memory page deallocation allows a smaller amount of memory to be deallocated than is possible to be deconfigured at the hardware level. Hardware deallocation of memory can also affect the way the different ECC word are interleaved at an address level and this may affect the performance of the computing system.
Some computers implement the concept of logical partitioning, which poses challenges for page deallocation. In logical partitioning, a single physical computer is permitted to operate essentially like multiple and independent virtual computers, referred to as logical partitions, with the various resources in the physical computer (e.g., processors, memory, and input/output devices) allocated among the various logical partitions. Each logical partition executes a separate operating system, and from the perspective of users and of the software applications executing on the logical partition, operates as a fully independent computer. Each of the multiple operating systems runs in a separate partition, which operates under the control of a partition manager or hypervisor.
Page deallocation requires the cooperation of the operating system of the logical partition, and therefore, the operating system must be executing in order to deallocate the page. But, in a logically-partitioned computer, a partition may have allocated pages even though the operating system for that partition is not necessarily executing. In addition, even if the operating system is executing, the operating system might not be able to deallocate the page because the page is in pinned or bolted memory. Further, an uncorrectable memory error can cause the memory error to persist and may prevent the partition and its operating system from initializing, so that the operating system, whose cooperation is required to deallocate the page, is prevented from booting to the point where it can deallocate the page. Finally, if the uncorrectable error is in the boot, or initialization, path, the partition cannot IPL (initial program load) or initialize until the entire computer system is rebooted, which causes inconvenience and delay for all users of the computer system, not just users of the partition that encountered the memory error.
Thus, a different technique is needed for deallocating memory in logically-partitioned computers that have encountered uncorrectable errors.