The present invention relates generally to computing systems, and more particularly to a multiple processing system that incorporates a number of processing and input/output units interconnected in clustered environment and employing a distributed operating system in a manner that provides posthumous recovery strategies for failed units of the system.
Distributed computing systems, which for some time have held out the hope of more highly available computing environments, are now realizing that hope. By distributing computing resources among a group of processor environments, a system can continue to make available those computing resources, even in the face of a failure of one of the computing environments, by calling into operation a copy of a computing resource that may be lost with that failure. However, in such systems the addition of processing components can produce less availability if the underlying components are not engineered for fault detection, isolation, and recovery. Traditionally, such distributed or clustered environments are visible in application space. But, application programs and processes must be developed with the awareness that they are executing inside a cluster of independent computing elements, or "nodes," in order to scale performance.
A recent extension to clustered environments is the application of transparent network computing ("TNC") which operates to enable a collection of independent computing environments, both uniprocessor and multiprocessor and any associated processing elements and processes, to be organized in a single system image ("SSI"). TNC (an SSI clustering environment) has historical ties to the Transparent Computing Facility (TCF) in the IBM AIX operating system for PS/2 and IBM 370 systems. (A more complete discussion of TCF can be found in "Distributed UNIX Transparency: Goals, Benefits, and the TCF Example," by Walker, and J. Popek; Proceedings of the Winter 1989 Uniforum Conference.) TCF has been replaced by TNC, and early versions of TNC appeared in the OSF/I AD kernel for the Intel Paragon. The latest versions of TNC have been ported to UNIX SVR4.2 ES/MP and Unixware 2.1 versions of the UNIX software and are incorporated in a loosely-coupled UNIX software package called "Non Stop Clusters for SCO Unixware" ("NSC"), a single system image clustering technology from Tandem Computers Incorporated, Cupertino, Calif. The TNC technology, as available, has been enhanced to remove risk associated with single points of failure within the distributed operating system. This effort as contained in the NSC software, represents a collection of software subsystems that provide scalability, availability, and manageability for a single system image cluster.
TNC technology enables a collection of processors to operate so that all machine resources are presented as a single system. The boundaries between platforms, interconnections, and the complexity of programming and managing a collection of independent processors is eliminated. Applications are unaware that they are executing any differently than they would in any uniprocessor or symmetrical multiprocessing (SMP) environment. They are able, nevertheless, to benefit from distributed resources as though they existed in a single system.
The single system image provided by TNC enables efficient process migration, since all resources appear the same regardless of the node on which a process happens to be executing. The single system image and process migration are critical to features such as manageability and scalability. System management is simplified during software upgrades, because processes can be migrated around the cluster rather than be terminated and restarted. As each node migrates all of its processes and operating system resources, it can then be safely upgraded. Later processes can be migrated back to the original node or redistributed by another NSC feature, automatic load balancing. As processes are balanced over the SSI cluster, system throughput increases. The increase in performance from load balancing is not limited to the migration of processes. Other operating system resources, such as shared memory or devices, are migrated in order to scale performance as more nodes are introduced to the cluster. Since the balancing is performed by the operating system, the increased performance is transparent to the applications. That is to say, no special clustering code is required to improve performance within the single system image cluster.
Process migration, however, has its limits. Typically, it is only the address space and execution threads of a process that are moved from one node to another. Other resources related to the migrated process (e.g., open files, inter-process communication facilities, shared memory descriptors) stay behind. Thus, should a node fail, the resources of that node may be lost, and any processes that have been migrated from that node will be hampered or similarly lost.
In the past, a computing element (i.e., processor unit) of a distributed or clustered system experiencing a catastrophic system failure, or system panics, is lost to the system. Such failures have resulted in valuable information stored in computer memory being irreclaimable to the remainder of the system. The loss of such information has impeded continuous computing availability. Examination of the memory of the failed unit is limited to using interactive debugging tools at best. No attempt is made to recover the valuable information stored in the memory of the failed unit.
Thus, it can be seen that there exists a need to automatically recover computer memory that was previously considered lost. Preferably, such a method should incur no run-time overhead associated with traditional checkpointing techniques.