1. Field of the Invention
The present invention relates generally to systems and methods for memory failure recovery, and more particularly to memory failure recovery using lock-step processes.
2. Discussion of Background Art
Memory Failure Recovery (MFR) describes an area within the general field of fault tolerant computer systems. Fault-tolerant computer systems or components incorporate backup hardware and/or software which are designed to be quickly brought on-line upon failure of primary hardware and software elements with minimal loss of service. Well known manufacturers of fault-tolerant systems and components include Compaq's Non-Stop product line, Marathon Technologies, and Stratus Computer.
Fault-tolerant techniques include periodically “check-pointing” critical data, duplexing selected hardware components, such as the microprocessors, mirroring disks, and “lock-stepping” multiple processors together. When a failure occurs, ideally the fault-tolerant system repairs itself often without even interrupting internal processes or computer users.
MFR techniques also include fault-tolerant systems for recovering from memory hardware errors. Three kinds of memory hardware errors exist: design errors, hard errors, and soft errors. Repair techniques for design and hard errors are almost always fatal unless protected against, and are typically limited to either refining the hardware's design or replacing an actual hardware component which failed. However, as design techniques and hardware reliability have improved, design and hard errors have become a dwindling portion of memory hardware errors.
Instead, in matured and refined hardware systems soft errors are a growing and often the highest percentage of all three types of memory hardware errors. Soft errors occur on well designed and reliable hardware which has been affected by one or more unpredictable events in the operating environment. As examples, background radiation and cosmic rays can randomly and unpredictably interfere with memory hardware operation and/or corrupt data stored therein.
Soft errors are a pointedly serious problem in low-profit margin Commodity Off The Shelf (COTS) systems. Such systems typically have very minimal, if any, hardware redundancy and/or error detection and correction systems, even though they are becoming ubiquitous tools within the office and home.
Mass marketed systems have two simple forms of support for memory soft errors. For several years memory systems have been available for commodity systems using parity or Error Correction Codes (ECC) to detect the presence of errors in memory and correct single bit errors. On error, systems either bring themselves to an abrupt halt or cause a severe signal in the processor. Low-cost processors, such as an Intel IA-32 and IA-64 processors, now contain this signaling support which is called a Machine Check Abort (MCA) exception. On the detection of an error, this severe error typically leads to a system halt performed by the operating system. Error correction codes are effective for detecting errors and correcting the simplest errors, however, the fore mentioned system does not cater for recovery from errors when they do occur and cannot be corrected in hardware.
FIG. 1 is a data-flow diagram of memory failure within such a IA-64 COTS computer system 100. A typical IA-64 computer system 100 includes a kernel process 102 in communication with a large number of other computer processes, such as process 104, over an input-output (I/O) channel 106. In response to a soft memory error 108 which corrupts process 104, the kernel 102 generates an MCA signal 110 which typically requires that process 104 be terminated. If process 104 served an application or some other top-level program, or utility, such programs or utilities will then terminate, perhaps resulting in a substantial loss of important data which had not yet been saved. Even worse, process 104 could have been a key operating system process which causes a system crash, requiring that the whole computer be rebooted. Such a drastic action not only results in a loss of important data and perhaps termination of network communications, but also results in a significant loss of time to the computer's 100 users, who must not only reboot the computer, but also bring up the application programs again and perhaps re-enter data.
Lock-step processors, mentioned above, are one approach toward implementing fault-tolerant computing systems which can perhaps recover from some design and hard errors. Lock-step processors are found within Compaq Himalayas Non-Stop Series of computers and IBM's S/390 computer series. Lock-step processor systems include two hardware processors strictly synchronized cycle-by-cycle. They execute exactly the same instruction each cycle. Lock-step systems also include a substantial amount of internal circuitry inside each of the processors for internally checking that the two lock-stepped processors are indeed operating consistently. Lock-step processors, however, are still vulnerable to memory hard and soft errors since the two processors share memory resources. Thus, if the shared memory fails, the lock-step processors will not be able to recover and the computer must be rebooted. Even further, lock-step processor systems are very expensive, since duplication of very expensive and necessarily complex circuitry is required.
Another approach toward fault-tolerant computing employs fail-over clusters. A fail-over cluster consists of at least two interconnected nodes/computers. The two nodes rely on intercommunication of shared data for recovery support. During normal operation, the two nodes share a predetermined portion of all processing tasks. Upon failure of one of the nodes, however, the other node assumes responsibility for all processing tasks. Such clusters also suffer from the same cost and complexity limitations, due to the node duplication required. Furthermore, upon a failure condition in such clusters, all processing tasks are switched over to the other node/computer, which may not always be a desirable situation due to the high load.
As a final example, Cornell University has developed a fault-tolerant computing technique based on “Hyper-Visors.” A Hyper-Visor is a software virtual machine that is instantiated between a computer's processor and the computer's operating system, and gives the illusion of multiple processors on one processor. In a typical fault-tolerant Hyper-Visor implementation, the processor hosting a copy of the Hyper-Visor, is part of a complete system, but the Hyper-Visor gives the illusion of multiple processors sharing the rest of the system. The Hyper-Visor implements two or more processors, each of which is able to run its own operating system, application program, and utility processes. During normal operation, only the first virtual processor interacts with system software and resources. Upon a failure on the first virtual processor, however, the backup virtual processor takes over and processing continues. Like the fail-over cluster technique, all application jobs are switched over to the other Hyper-Visor processor. However, since virtual processors are sharing resources, such as memory and disks, errors in these may affect both virtual machines. Lastly, virtual machines must present a fault isolation boundary to be effective for fail-over support. Unfortunately, this requires hardware support for the virtual machine monitor and critical system errors such as memory errors may not be isolatable.
In response to the concerns discussed above, what is needed is a system and method for memory failure recovery that overcomes the problems of the prior art.