1. Technical Field
The present invention relates in general to a method and system for establishing persistent reserves in a clustered computer environment.
2. Description of the Related Art
Server computer systems are used to provide many functions to computer networks. On the Internet, server computers are used to host web sites that provide users with an array of services, including electronic shopping, consumer information, reference materials, communication with other Internet users, and scores of other uses.
Users of online servers, both Internet accessible and private, or intranet, accessible, demand high availability to data and programs provided by these servers. Nonvolatile storage devices include mass storage devices such as hard disks, magneto-optical drives, and storage area networks (SANs). Nonvolatile storage devices provide a repository for data and programs used by server computers.
In order to provide high availability, multiple server computers are often clustered to provide redundant, or backup, servers in case a server fails. Each of the multiple servers, or nodes, can access the nonvolatile storage device that is shared among the servers in the cluster. However, having more than one node simultaneously write to a common nonvolatile storage device may introduce data corruption and other failures on the nonvolatile storage device. To prevent corrupting data on the nonvolatile storage device, a persistent reserve is created on the nonvolatile storage device.
The persistent reserve is a means of reserving a nonvolatile storage device for a particular node in the cluster. One method of establishing a persistent reserve is by using the Small Computer System Interface (SCSI). SCSI provides a protocol and a set of commands for establishing a persistent reserve.
In a clustered environment, one node establishes a disk reserve thereby reserving a nonvolatile storage device. The node with the reserve prevents other nodes from accidentally writing to the device. However, if the first node fails, a backup node is able to break the first node""s reservation and reserve the nonvolatile storage device for itself. The backup node determines whether the primary is operational by listening for a signal, sometimes called a xe2x80x9cheartbeat,xe2x80x9d that is sent by the primary computer. In this manner, service from the server is uninterrupted from the perspective of an end user. While the prior art provides redundancy and some level of reserves, challenges still face the clustered environment in providing fail over support.
When the primary server fails and is subsequently reinitialized, it attempts to resume control of the nonvolatile storage device. The primary server breaks the backup server""s reserve held on the nonvolatile storage device and resets the primary reserve. The backup server, meanwhile, has been set to act as the new primary server (since the first primary server failed) causing the backup server to once again break the primary server""s reserve and again reset the primary reserve. The primary and backup servers can continue to thrash for control of the nonvolatile storage device decreasing system throughput and efficiency.
In addition, some computer systems, such as non-uniform memory architecture (NUMA) computer systems, have multiple paths to the nonvolatile storage device. These paths include processors and corresponding memory areas. To improve performance, each of the paths is connected to the nonvolatile storage device across a separate connection. A challenge with the prior art is that establishing a disk reserve only allows one of the two or more paths to operate at a time. To allow both paths to operate simultaneously, the nonvolatile storage device can be opened without reserving the device, however as discussed previously this may result in multiple nodes writing to the nonvolatile storage device and corrupting the data.
For further information regarding persistent reserves in a SCSI environment, see the T10 homepage (www.t10.org). T10 is a Technical Committee of the National Committee on Information Technology Standards. Documents specific to persistent reserves using the SCS3 protocol can be found in the T10 Document Proposals section (www.t10.org/doc98.htm) of the web site. Persistent Reserve documents in the section include xe2x80x9cSPC-2, Persistent Reservation: Additional proposed corrections,xe2x80x9d (Doc. Nos. 98-124R0. through R2), xe2x80x9cClarification of Persistant Reservation,xe2x80x9d (Doc. No. 98-140R0), xe2x80x9cPersistent Reservations,xe2x80x9d (Doc. Nos. 98-203R0 through R0), as well as other information generally found throughout the T10 web site.
It has been discovered that creating a reserve based on a key that includes an computer identifier that identifies the computer with the reserve allows the computer to access the nonvolatile storage device using more than one path. In addition, the identifier is used to prevent a reinitialized server from inadvertently breaking a backup server""s reserve thus preventing the reinitialized server and the backup server from thrashing for control of the nonvolatile storage device.
The first server writes a reservation key to the reservation storage area. The first server""s reservation key identifies the first server as having the reservation to the nonvolatile storage device. If the first server (or any subsequent server in control of the nonvolatile storage device.) has multiple paths to the nonvolatile storage device, each of the paths use a reservation key allowing each path to access and write to the nonvolatile storage device. In one embodiment, writing a reservation key to the reservation storage area includes registering the key with the device and establishing a reserve of the device that is accomplished by a server in a single step.
The first server sends a signal, or xe2x80x9cheartbeatxe2x80x9d, across a network or other connection to one or more backup servers. So long as the signal is received by the backup servers, the backup servers do not attempt to break the first server""s reserve and write to the nonvolatile storage device. However, when the signal is terminated, one of the backup servers breaks the first server""s reserve and over-writes the reservation storage area with a key identifying the backup server as the server reserving the nonvolatile storage device. When the first server is reinitialized, it reads the reservation storage area and determines that the backup server now has reserved the nonvolatile storage device. In one embodiment, namely a SCSI environment or using a SCSI protocol performed in a fiber channel environment, the process of breaking the reserve includes registering as new key, revoking the prior reservation, and establishing a new reserve that is accomplished in a single step.
In one embodiment, the first server sends a message to the backup server informing the backup server that the first server is once again operational. In this embodiment, an orderly switch is made replacing the backup server""s reservation with the first server""s reservation key. This embodiment is useful when the first server has better processing capabilities than the backup server.
In another embodiment, when the first server is reinitialized it reads the reservation storage area and determines that the backup server has reserved the nonvolatile storage device. In this embodiment, the first server assumes a backup role and listens to a signal, or heartbeat, being sent by the backup server. When the backup server""s signal terminates, indicating that the backup server is no longer operational, the first server breaks the reservation and once again reserves the nonvolatile storage device for itself.
In yet another embodiment, multiple backup servers provide backup support. When the primary server fails, multiple backup servers may attempt to take the place of the failed primary server. The first backup server that compares and successfully matches the reservation key that was owned by the primary server with the reservation key stored in the reservation storage area breaks the reserve and establishes its own reserve to the nonvolatile storage device. Thereafter, other backup servers, from the same or different clusters, compare the key with the storage area and no longer receive successful matches because the first backup server has already established a new reserve. Because the match is unsuccessful, the other backup servers do not break the reserve that has been established.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.