1. Field of the Invention
The present invention relates generally to systems and methods for recoverable programming, and more particularly to a recoverable programming system and method for memory system failures in multi-processor computer systems.
2. Discussion of Background Art
Demand for increased performance and high availability of commodity computers is increasing with the ubiquitous use of computers and the Internet services which serve them. While commodity systems are tackling the performance issues, availability has received less attention. It is a common belief that software (SW) errors and administration time are, and will continue to be, the most probable cause of the loss of availability. While such failures are clearly commonplace, especially in desktop environments, it is believed that certain other hardware (HW) errors are also becoming more probable.
Processors, caches, and memories are becoming larger, faster and more dense, while being increasingly used in ubiquitous and adverse environments such as at high altitudes, in space, and in industrial applications. Articles, such as Ziegler, J. F., et al., “IBM Experiments in Soft Fails in Computer Electronics (1978-1994)”, IBM Journal of R&D, vol 40, no 1, pp 3-18, January 1996, and Ziegler, J. F., “Terrestrial Cosmic Rays”, IBM Journal of R&D, vol 40, no 1, pp 19-40, January 1996, have shown that these changes will lead to increased transient errors in CMOS memory due to the effects of cosmic rays, approximately 6000 FIT (1 FIT equals 1 failure in 10 9 h) for one 4 Mbit DRAM.
Tandem (see, Compaq Corporation, “Data Integrity for Compaq NonStop Himalaya Servers”, White Paper, 1999) indicates that such errors also apply to processor cores or on-chip caches at modern die sizes/voltage levels. They claim that processors, cache, and main memory are all susceptible to high transient error rates. A typical processor's silicon can have a soft-error rate of 4000 FIT, of which approximately 50% will affect processor logic and 50% the large on-chip cache. Due to increasing speeds, denser technology, and lower voltages, such errors are likely to become more probable than other single hardware component failures. With the increasing evolution to larger tightly interconnected commodity machines (such as Sun's Enterprise 10000 machines), the probability of soft-errors and error containment problems increases further. Soft-error probability increases not only due to increased system scale, but also due to an increased number of components on the memory access path. Since the machines are tightly coupled, memory path soft-errors introduce error containment problems which without some form of soft-ware error containment can lead to complete loss of availability.
Techniques such as Error Correction Codes (ECC) and ChipKill (see, Dell, T. J., “A White Paper on the benefits of Chipkill Correct ECC for PC Server Main Memory”, IBM Microelectronics Division, November 1997) have been used in main memories and interconnects to correct some of these errors (90% for ECC). Unfortunately such techniques, only help reduce visible error rates for semiconductor elements that can be covered by such codes (large storage elements). With raw error rates increasing with technological progress and more complicated interconnected memory subsystems, ECC is unable to address all the soft-error problems. For example, a 1 Gb memory system based on 64 Mbit DRAMs still has a combined visible error rate of 3435 FIT when using Single Error Correct Double Error Detect (SECDED) ECC. This is equivalent to around 900 errors in 10000 machines in 3 years. Unfortunately, current commodity hardware and software provide little to no support for recovery from errors not covered by ECC whether detected or not. Such problems have been considered by mainframe technology for years, but in the field of commodity hardware, it is currently not cost effective to provide full redundancy/support in order to mask errors. Therefore, the burden falls to commodity hardware and the software using it to attempt to handle these errors for the highest availability.
Most contemporary commodity computer systems, while providing good performance, pay little attention to availability issues resulting from such errors. For example, the IA-32 architecture supports only ECC on main memory rather than across the system, requiring system reboot on errors not covered by this ECC. Consequently, commodity software such as the OS, middleware and applications have been unable to deal with the problem. Future commodity processor architectures may provide support to detect and notify the system of such probable errors. For instance, upcoming IA-64 processors, while not recoverable in the general case, do offer some support with certain limitations.
Availability in computer systems is determined by hardware and software reliability. Hardware reliability has traditionally existed only in proprietary servers, with specialized redundantly configured hardware and critical software components, possibly with support for processor pairs (see, Bartlett, J., “A Nonstop Kernel”, Proceedings of the Eighth Symposium on Operating Systems Principles, Asilomar, Ca, pp 22-29, December 1981), e.g. IBM S/390 Parallel Sysplex (see, Nick, J. M., et al., “S/390 Cluster Technology: Parallel Sysplex”, IBMSystems Journal, vol 36, no 2., pp 172-201, 1997), and Tandem NonStop Himalaya (see, Compaq, Product description for Tandem Nonstop Kernel 3.0. Download February 2000, http://www.tandem.com).
Sysplex supports hot swap execution, redundant shared disk with fault-aware system software for error detection and fail-over restart. Tandem supports redundant fail-over lock-stepped processors with a NonStop kernel and middleware, which provide improved integrity through the software stack. These systems provide full automatic support to mask the effects of data and resource loss. They rely on reliable memory and fail-over rather than direct memory error recovery. Another approach is fault containment and recovery with “node” granularity. In these systems, each node has its own kernel. When one node fails, the others can recover and continue to provide services. Systems of this type include the early cluster systems (see, Pfister, G., “In Search of Clusters”, Prentice Hall, 1998), and NUMA architectures, such as Hive (see, Chapin, J., et al., “Hive: Fault Containment for Shared Memory Multiprocessors,” Proc. of the 15th SOSP, December 1995, pp 12-25, and Teodosiu, D., et al., “Hardware Fault Containment in Scalable Shared Memory Multiprocessors,” Proc. of the 24th ISCA, pp 73-84, June 1997). Hardware faults are difficult to catch and repeat.
Software reliability has been more difficult to achieve in commodity software even with extensive testing and quality assurance. Commodity software fault recovery has not evolved very far. Most operating systems support some form of memory protection between units of execution to detect and prevent wild read/writes. But most commodity operating systems have not tackled problems of memory errors themselves or taken up software reliability research in general. Examples include Windows 2000 and Linux. They typically rely on failover solutions, such as Wolfpack by Microsoft. A lot of work has been undertaken in the fault-tolerant community regarding the problems of reliability and its recovery in software (see, Brown, N. S. and Pradhan, D. K. “Processor and Memory-Based Checkpoint And Rollback Recovery”, IEEE Computer, pp 22-31, February 1993; Gray, J., and Reuter, A., “Transaction Processing: Concepts and Techniques,” Morgan Kaufmann, 1993; and Kermarrec, A M., et al., “A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability”, Proc. of the 25th FTCS, pp 289-298, June 1995).
These include techniques such as checkpointing and backward error recovery. A lot of this work has been conducted in the context of distributed systems rather than in single systems. There are also techniques for efficient recoverable software components, e.g. RIO file cache (see, Chen, P. M., et al., “The Rio File Cache: Surviving Operating System Crashes”, Proc. of the 7th ASPLOS, pp 74-83, October 1996), and Recoverable Virtual Memory (RVM) (see, Satyanarayanan, et al. “Lightweight Recoverable Virtual Memory”. Proc. SOSP, pp 146-160, December 1993).
Rio takes an interesting software-based approach to fault containment aimed at a fault-tolerant file cache, but with general uses. By instrumenting access to shared data structures with memory protection operations, wild access to the shared data structures becomes improbable.
Other methods for handling memory errors include a try-except block solution. In general, the try-except mechanism itself is not sufficient to handle memory failures. The saved state needed for memory failures is more extensive (as an example, for IA-64 architecture) than what can be obtained by try-except. Thus saving state is an expensive operation in terms of system overhead.
Since current responses to memory failures are costly to invoke and execute, do not guarantee recovery under all cases for next generation processors, such as IA64, and are impossible to recover at all for current generations of commodity processors, such as the IA32 family, what is needed is a system and method for recoverable programming that overcomes the problems of the prior art.