1. Field of the Invention
Embodiments of the present invention generally relate to data storage systems, and more particularly, to a method and apparatus for performing backup storage of checkpoint data.
2. Description of the Related Art
Modern computer networks generally comprise a plurality of user computers connected to one another and to a computer server via a communications network. To provide redundancy and high availability of information and applications that are executed upon a computer server, multiple computer servers may be arranged in a cluster, i.e., forming a server cluster. Such server clusters are available under the trademark VERITAS CLUSTER SERVER from Veritas Software Corporation of Mountain View, Calif. In a server clusters, the plurality of servers communicate with one another to facilitate failover redundancy such that when software or hardware (i.e., computer resources) become inoperative on one server, another server can quickly execute the same software that was running on the inoperative server substantially without interruption. As such, a user of services that are supported by a server cluster would not be substantially impacted by an inoperative server or software.
To facilitate the substantially seamless transition of user service to another server within the server cluster, the production server, i.e., the server that is presently supporting users of the server services, stores checkpoint data in random access memory (RAM). This checkpoint data is essentially the data being used by software at particular times as well as the server state at those particular times. The backup software within the production server takes a “snapshot” of the data and the server state, then stores that information as checkpoint data. To create redundancy, the checkpoint data is remotely stored on a backup server, i.e., another server in the server cluster. Upon failure of the software or production server, the software is booted on the backup server and the checkpoint data can be used to start the software at approximately the position within the software where the failure occurred.
Upon failover, the backup server becomes the production server from the view of the user without substantial interruption of the software utilization. Thus, upon failover to the backup server, the software is executed from the last saved state which can then use the stored data related to that saved state.
The storage of the checkpoint data on a single backup server or a node within the server cluster can be problematic if the production server and the backup server fail simultaneously.
Therefore, there is a need in the art for a method and apparatus for improving the fault tolerance of the storage of checkpoint data within a server cluster.