This invention relates, in general, to managing resources of a computing environment, and in particular, to managing the checkpointing/restarting of the resources of the computing environment.
The ability to recover from failures within a computing environment is of paramount importance to the users of that environment. Thus, steps have been taken to facilitate the recovery from failures.
One technique currently provided to facilitate the recovery from failures is to periodically take checkpoints of the resources of the computing environment. In particular, at certain times, each resource saves the current state of the resource, in the event that the state is needed to recover from a failure. The manner in which the checkpoint is taken by the resource is resource specific.
Thereafter, if a failure occurs requiring one or more resources to be restarted, the restarted resources bring themselves back to the state they were in when the checkpoints were taken. This provides a mechanism to recover from failures.
Although some recovery mechanisms exist today, a need still exists for mechanisms that improve the management of resources within the computing environment. In particular, a need exists for a capability that better manages the checkpointing and restarting of resources.
A further need exists for a capability that separates the decision to checkpoint/restart from the initiating and/or performing of the checkpoint/restart. Further, a need exists for a checkpoint/restart capability that is suitable for distributed environments, including heterogeneous environments.
A yet further need exists for a capability that cleans up checkpoint information that is no longer desired.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of managing the checkpointing of resources of a computing environment. The method includes, for instance, determining, by a first entity of the computer environment, that a checkpoint of a resource of the computing environment is to be taken; and initiating the taking of the checkpoint of the resource by a second entity of the computing environment.
In one embodiment, the first entity has no knowledge of implementation details associated with the initiating the taking of the checkpoint.
In a further embodiment, the second entity is informed of the determination to checkpoint the resource by the first entity invoking an interface of the second entity indicative of the determination to take a checkpoint.
In a further embodiment, the checkpoint is used to restart the resource. As one example, the first entity makes the determination to restart the resource and forwards this determination to the second entity by invoking an interface of the second entity indicative of the determination to restart.
In another embodiment, a plurality of checkpoints of a plurality of resources is initiated by at least one second entity.
In one embodiment, at least one resource of the plurality of resources is executing on a computing node of the computing environment having a first operating system, and at least one other resource of the plurality of resources is executing on another computing node of the computing environment having a second operating system, which is different from the first operating system.
In another aspect of the present invention, a method of managing the restarting of resources of a computing environment is provided. The method includes, for instance, determining, by a first entity of the computing environment, that a resource of the computing environment is to be restarted; and initiating the restarting of the resource by a second entity of the computing environment.
In another aspect of the present invention, a system of managing the checkpointing of resources of a computing environment is provided. The system includes, for instance, a first entity of the computing environment adapted to determine that a checkpoint of a resource of the computing environment is to be taken; and a second entity of the computing environment adapted to initiate the taking of the checkpoint of the resource.
In a further aspect of the present invention, a system of managing the restarting of resources of a computing environment is provided. The system includes, for example, a first entity of the computing environment being adapted to determine that a resource of the computing environment is to be restarted; and a second entity of the computing environment being adapted to initiate the restarting of the resource.
In yet another aspect of the present invention, an article of manufacture including at least one computer usable medium having computer readable program code means embodied therein for causing the managing of the checkpointing of the resources of the computing environment is provided. The computer readable program code means in the article of manufacture includes, for instance, computer readable program code means for causing a computer to determine, by a first entity of the computing environment, that a checkpoint of a resource of a computing environment is to be taken; and computer readable program code means for causing a computer to initiate the taking of the checkpoint of the resource by a second entity of the computing environment.
In a further aspect of the present invention, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of managing the restarting of resources of a computing environment is provided. The method includes, for example, determining, by a first entity of the computing environment, that a resource of the computing environment is to be restarted; and initiating the restarting of the resource by a second entity of the computing environment.
In accordance with the principles of the present invention, capabilities are provided for managing the checkpointing and restarting of resources of a computing environment, such as a homogeneous or heterogeneous distributed computing environment. Advantageously, an entity of the computing environment, other than the entity initiating or performing the checkpoint/restart, is responsible for determining when a checkpoint or restart is to be performed. The entity making this determination (i.e., the determining entity) need not know how to initiate or perform the checkpoint/restart. The entity responsible for initiating the checkpoint/restart provides interfaces to the determining entity, which the determining entity uses to notify the initiating entity of when to proceed.
In addition to the above, a capability is advantageously provided to clean up old checkpoint information that is no longer desired. This cleanup is initiated by the determining entity.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.