The present invention relates generally to distributed computing systems, and more particularly, to management techniques for services, such as fault tolerance and fault recovery services, that may be utilized by an application process executing in a distributed system.
Increasingly, software applications must be resistant, or at least tolerant, to software faults. Users of telecommunication switching systems, for example, demand that the switching systems are continuously available. In addition, where transmissions involve financial transactions, such as for automated teller machines (ATMs), or other sensitive data, customers also demand the highest degree of data consistency.
While software testing and debugging tools provide an effective basis for detecting many programming errors during the software development stage which may lead to a fault in the user application process, no amount of verification, validation or testing during the software debugging process will detect and eliminate all software faults and give complete confidence in a user application program. Accordingly, residual faults due to untested boundary conditions, unanticipated exceptions and unexpected execution environments have been observed to escape the testing and debugging process and, when triggered during program execution, will manifest themselves and cause the application process to crash or hang, thereby causing service interruption.
It is therefore desirable to provide mechanisms that allow a user application process to recover from a fault with minimal amount of lost information. Thus, in order to minimize the amount of lost information, a number of checkpointing and restoration techniques have been proposed to recover more efficiently from hardware and software failures. For a general discussion of checkpointing and rollback recovery techniques, see R. Koo and S. Toueg, xe2x80x9cCheckpointing and Rollback-Recovery for Distributed Systems,xe2x80x9d IEEE Trans. Software Eng., Vol. SE-13, No. 1, pp.23-31 (January 1987). Generally, checkpoint and restoration techniques periodically save the process state during normal execution, and thereafter restore the saved state following a failure. In this manner, the amount of lost work is minimized to progress made by the user application process since the restored checkpoint.
As applications have become more sophisticated and distributed, their design and implementation have become a complex task. In a distributed computing environment, processes from heavily loaded machines can be migrated to more lightly loaded machines in order to utilize the available computing resources more efficiently. In addition, the availability of alternative machines in a distributed computing environment allows a failed process to be migrated following a failure to a checkpointed state on an alternative processor.
To facilitate the development of distributed applications, many middleware techniques and platforms have been proposed, such as the increasingly popular Common Object Request Broker Architecture (CORBA). Although CORBA eases the development of distributed applications, CORBA does not currently address the reliability and availability requirements found in many applications, especially in the telecommunications world. In order to improve the reliability and availability of applications, some researchers have implemented Object Request Brokers (ORBs) based on the concept of group communication and virtual synchrony. For a more detailed discussion of ORB-based reliability and availability techniques, see, for example, S. Maffeis, xe2x80x9cPiranhaxe2x80x94A CORBA Tool For High Availability,xe2x80x9d IEEE Computer (April 1997), or S. Maffeis and D. C. Schmidt, xe2x80x9cConstructing Reliable Distributed Communication Systems With CORBA,xe2x80x9d IEEE Communications Magazine, vol. 14, no. 2 (February 1997).
Another approach to providing fault-tolerance to CORBA applications is a service approach that extends the existing set of CORBA services with a fault tolerance service. The service approach defines a set of objects and object interfaces to provide fault tolerance, referred to as the Fault Tolerance Service (FTS). An FTS system is implemented as a collection of interacting CORBA objects, that detect CORBA object failures and host failures, and recover CORBA objects from such failures. An application developer may improve the reliability of an application using the FTS service to implement fault-tolerant CORBA objects.
Although the FTS service effectively detects CORBA object failures and host failures, and recovers CORBA objects from such failures, the FTS service, as well as other CORBA services, suffers from a number of limitations, which, if overcome, could greatly expand the utility and efficiency of such services. In particular, few, if any, existing CORBA services have exploited the advantages of responding to run-time environmental conditions.
Generally, one or more service managers are disclosed that provide a management interface to corresponding middleware services. According to an aspect of the invention, the service manager monitors the corresponding middleware service, as well as the underlying distributed computer environment on which an application process that utilizes the middleware service is executing. The data received from the middleware service permits the service manager to monitor the operation of the middleware service. In addition, the information received from the underlying distributed computer environment allows the middleware service to operate more efficiently, in response to run-time environmental conditions. The addition of separate management utilities to middleware services, improves the operation of the service and results in a three-step architecture: the base application process, a middleware service, and the service manager.
In one illustrative implementation, a fault-tolerance service manager provides a management interface to a fault-tolerance service (FTS). The fault-tolerance service permits an application developer to enhance the availability and reliability of an application built on top of the middleware platform. While the registration process for a conventional fault-tolerance service typically relies on static (usually hard-coded) information, the present invention allows the registration and replica management of application objects to be performed based on run-time environmental conditions.
Generally, the fault-tolerance service manager monitors the fault-tolerance service, as well as the underlying distributed computer environment. In this manner, the fault-tolerance service manager can make globally optimal decisions, based on received run-time data, and provide the resulting information (processed data or specific decisions) to the fault-tolerance service. The present invention allows the fault-tolerance service to tolerate failures using a failure-prevention approach, whereby the fault-tolerance service takes corrective action and migrates CORBA objects if the fault-tolerance service manager detects that an object""s local host may crash soon. In one embodiment, the likelihood of a host failure is determined based on a health rating of the respective host or other system components.
The fault-tolerance service manager obtains data about the operation of the fault-tolerance service, such as the names, number and type of registered objects, and the location and status of various entities within the fault-tolerance service, such as watchdogs and the super watchdog. In addition, the fault-tolerance service manager collects additional information about the underlying computing platform, such as the status of the operating system resources, the instantaneous load, failure rate or performance of one or more machines, or the status of the communication links in the computing environment, processes the collected environmental information and provides feedback to the fault-tolerance service. The collected environmental data can be used to determine a health rating of components within the computing environment which can be utilized, for example, to select an optimal machine for migration, or to trigger migration or additional replication in the event the health rating indicates that a failure is expected.
In one preferred implementation, the fault-tolerance service manager does not change or perform any of the functions of the fault-tolerance service, nor does the fault-tolerance service manager assume any responsibility about decisions relate d to the fault-tolerance mechanisms. Thus, the fault-tolerance service performs its own intended functions even in the absence of the fault-tolerance service manager. Since the fault-tolerance service manager has minimal interference with the fault-tolerance service, the fault-tolerance service manager may use existing management technology, such as the Simple Network Management Protocol (SNMP).