1. Field of the Invention
The present invention relates to computer hardware and software, and more particularly to a method and system for recovering the state of object-oriented software in the face of partial or total failure of the underlying computing platform.
2. Description of the Prior Art
Failure of a computer can often result in the loss of significant amounts of data and intermediate calculations. The cause of failure can be either hardware or software related, but in either instance the consequences can be expensive, particularly when data manipulations are interrupted in mid-stream. In the case of large software applications, a failure might require an extensive effort to regenerate the status of the software and data prior to the failure. Several techniques have been developed to address this problem, and are disclosed in the following issued U.S. Patents:
U.S. Pat. No. 5,594,861 discloses an error handling system in a telecommunications exchange. Certain objects within software applications are defensively programmed to detect and report errors. An error handler object provides process centralized error handling functionality, and is configured to determine and specify a recovery for returning the software application to a well defined state.
U.S. Pat. No. 5,151,987 discloses a system and method for recovering objects in an object oriented computing environment. A recovery from an unplanned failure is executed by storing recovery information in recovery objects. The recovery information is limited to only that information which is necessary to recover from unplanned failures.
U.S. Pat. No. 5,469,562 discloses a system that provides recovery from the effects of incompletely executed transactions in the event of a fault. During execution, certain data is stored in persistent memory. During fault recovery, the system calls the agent specific procedures, as needed, using the recovery and recovery sequence information stored during normal transaction execution.
U.S. Pat. No. 4,814,971 discloses a virtual memory recovery system wherein periodic checkpoints are taken of the state of a computer system. If a system crash occurs, the machine state can be rolled back to the checkpoint state and normal operation restarted. Modifications made after the checkpoint time are discarded when the system state is rolled back to the saved checkpoint state.
As used herein, the term xe2x80x9cpersistentxe2x80x9d is in reference to a computer memory storage device that can withstand a power reset without loss of the contents in memory. Persistent memory devices, have been used to store data for starting or restarting software applications. In simple systems, persistent memory devices are static and not modified as the software executes. The initial state of the software environment is stored in persistent memory. In the event of a power failure to the computer or some other failure, the software restarts its execution from the initial state. One problem with this approach is that all intermediate calculations will have to be recomputed. This can be particularly onerous if large amounts of user data must be reloaded during this process. If any of the user data is no longer available, it may not be possible to reconstruct the pre-failure state.
More sophisticated executable programs might dynamically update the configuration of persistent memory. The updates can take the form of a xe2x80x9csnapshot,xe2x80x9d or duplicate, of the entire contents of the relevant portion of computer memory. The updates can also be limited to certain key intermediate results. This allows for more efficient software recovery because intermediate calculations can be stored and then recovered from persistent memory. During the recovery process, the software restarts from its last saved state. For example, whenever a large batch of data is processed, a snapshot of the current state of the software can be stored to persistent memory. In the event of a failure, the large batch of data will not have to be reloaded and reprocessed.
Another class of solutions is to have redundant hardware configurations. In the event of a hardware failure, the redundant processors can take over the functions of the failed hardware. Ideally, this should happen with no human interaction, but in any event, within a time frame consistent with xe2x80x9chigh availabilityxe2x80x9d objectives. Most of these schemes depend upon a predefined xe2x80x9cconfigurationxe2x80x9d record, together with copies of the state of the xe2x80x9clostxe2x80x9d programs. Thus, the relationships are static, and often the recovered programs proceed using the last known state of the failed program. Often this happens without the recovered program having a record that a recovery has taken place.
Object-oriented programming environments present unique challenges for software recovery efforts. Software objects are typically encapsulated blocks of code that can be saved in persistent storage. In the event of a failure, this strategy can often recover the objects. Some examples of this approach can be found in the set of UNIX start scripts and user preferences used by end-user type applications. However, object oriented software environments typically have rich inter-object relationships. These relationships are established due to the logical dependencies between the objects. A simple recovery strategy can successfully recover the objects, but will not recover the inter-object relationships.
A desirable system for software recovery in object-oriented environments would recover the objects themselves, along with the inter-object relationships. The present invention addresses this need.
In an object oriented software environment, the present invention is a unified technique that addresses both state recovery and relationship recovery. It operates at the level of constituent components of the executing program, which are generally objects within an object-oriented software environment. The recovery system provides for cognizance to be taken by each component of any environmental changes that may have occurred between the failure and the recovery. Thus, the present invention is well suited to enterprise-class distributed systems with extensive object relationships, particularly when the software needs to be robust in the face of failures in various parts of the system.
The present invention restores objects, along with inter-object relationships, by intelligently rebuilding the software state based on fundamental, or xe2x80x9cessentialxe2x80x9d, information stored in persistent storage. Each object does not have to be restored to its exact pre-failure state. It is possible to make intelligent recovery decisions based on the state of the system after it is recovered. Thus, it is possible to make the system robust to certain hardware or software failures, since the system can intelligently compensate for the failure of individual elements.
According to the present invention, objects, and values within an object, are deemed xe2x80x9cessentialxe2x80x9d or xe2x80x9cnon-essentialxe2x80x9d based on the logical structure of the software. Values that can be recreated by reference to other values are xe2x80x9cnon-essentialxe2x80x9d because they can be recreated in the event of software failure. Values that must be stored in order to recreate the state of the software are xe2x80x9cessential.xe2x80x9d By extension, any object that contains an essential value is an xe2x80x9cessential object.xe2x80x9d
Essential objects are stored in persistent storage. In addition, each essential object updates its essential values to persistent storage according to a schedule that takes into account the logic required for reconstructing the object. For purposes of this disclosure, the process of updating the essential values in persistent storage is called xe2x80x9cpicklingxe2x80x9d the values.
After a failure, there is a two phase process for recovering the software. xe2x80x9cPhase 1xe2x80x9d recovery involves restoring, from persistent storage, an instance of each essential software object along with its essential values. xe2x80x9cPhase 2xe2x80x9d recovery involves executing a xe2x80x9chydratexe2x80x9d method within each essential object wherein it exists. The purpose of a hydrate method is to derive all non-essential values from essential values, and thereby reestablish inter-object relationships. The hydrate method is also configured to recreate non-essential objects. Each hydrate method contains logic for handling contingencies wherein certain hardware or software may be unavailable at the time of recovery. In general, each essential object will have a customized hydration method.
The process of pickling an object can often be generically defined for all objects. Typically, this is accomplished by making a method call to a xe2x80x9cpicklexe2x80x9d object having a xe2x80x9cpicklexe2x80x9d method. In general, each essential object has its own logic to determine the timing and frequency of the calls to the pickle object.
The present invention is also useful in situations wherein individual objects need to be removed or updated without taking down an entire object oriented computing environment. The removal of a particular object from a software system will often cause failure of the entire software environment. Instead, the present invention provides an architecture wherein an updated object can be restored in place of an older object, and the updated object will xe2x80x9chydratexe2x80x9d itself in an orderly progression without fatal consequences to the remainder of the software environment.