Initialization refers to the procedure of starting up a system and bringing the system to a point in its operation where it can begin performing its normal functions. In known distributed software applications, an overall software-based process/application may be separated into a number of software components that are distributed among a plurality of processors interconnected via a network. Each component performs some portion of the functionality of the overall software application.
For a system that consists of a single software component operating in isolation, initializing the system is a simple procedure. However, as the number of system components and their interaction increases, initialization becomes complex due to interdependencies between system components. An “interdependency” is a relationship between two or more components. When two components are interdependent, their initialization must be properly coordinated. A key challenge for a system consisting of multiple components is to initialize the system as quickly as possible while at the same time satisfying such interdependencies. For large-scale distributed systems consisting of hundreds or even thousands of components (such as those encountered in grid computing), this is especially important. A failure to appreciate the complex nature of the interdependences involved by, for example, initializing components one at a time, could result in an initialization time that takes too long to complete.
Further complicating matters is the fact that a failure may occur during initialization. Restarting an entire initialization procedure from the beginning may not be desirable because of the lengthy initialization times needed by various system components. Conversely, restarting and reinitializing only components that failed may not lead to a successfully initialized system because those components that did not fail (so-called “fault-free” components) may depend on a failed component, thus preventing initialization. For example, if a failed component is restarted, an interdependent fault-free component may need to re-establish new communication channels with the restarted component.
So-called fault tolerance techniques (e.g., rollback recovery) may be used to recover from failures that occur during initialization. In rollback recovery, each software component performs frequent periodic check pointing of its state and stores the state in a stable storage area. Upon failure of the software component, a backup software component accesses the stored state information. It is assumed that the most recent copy is correct. However, the state information may have changed since the last check pointing before the failure occurred. To overcome this problem, software components are continuously interrupted during normal operation in order to save their associated state information. Frequent periodic check pointing of software components, however, wastes time and resources, adds extra complexity to a system and imposes performance limitations. While a rollback recovery approach may be feasible in some cases, other techniques appear to be more promising.
Another technique uses characteristics unique to initialization to optimize recovery. For example, because all state information introduced into a component during initialization is either derived from hard-coded or other persistent information, such as configuration information stored in a database, or is determined from actions that take place during initialization (e.g., obtaining the handle to a communication channel involving another component), this state information can easily be recreated should the component fail. However, in some cases the recreated state information may be different from the original. Hence, preserving this state information (e.g., via checkpointing) is not required. Nonetheless, undo operations (e.g., closing a broken communication channel) are still required.
In sum, it is desirable to provide for techniques that provide for quick recovery from initialization failures.
It is also desirable to provide for techniques that take advantage of characteristics unique to initialization in order to provide for quick recovery from initialization failures.