The present disclosure relates in general to the field of computer networks, and, more particularly, to a system and method for providing backup server service in a multi-computer environment.
A cluster is a parallel or distributed system that comprises a collection of interconnected computer systems or servers that is used as a single, unified computing unit. Members of a cluster are referred to as nodes or systems. The cluster service is the collection of software on each node that manages cluster-related activity. The cluster service sees all resources as identical objects. Resource may include physical hardware devices, such as disk drives and network cards, or logical items, such as logical disk volumes, TCP/IP addresses, entire applications and databases, among other examples. A group is a collection of resources to be managed as a single unit. Generally, a group contains all of the components that are necessary for running a specific application and allowing a user to connect to the service provided by the application. Operations performed on a group typically affect all resources contained within that group. By coupling two or more servers together, clustering increases the system availability, performance, and capacity for network systems and applications.
Clustering may be used for parallel processing or parallel computing to simultaneously use two or more CPUs to execute an application or program. Clustering is a popular strategy for implementing parallel processing applications because it allows system administrators to leverage already existing computers and workstations. Because it is difficult to predict the number of requests that will be issued to a networked server, clustering is also useful for load balancing to distribute processing and communications activity evenly across a network system so that no single server is overwhelmed. If one server is running the risk of being swamped, requests may be forwarded to another clustered server with greater capacity. For example, busy Web sites may employ two or more clustered Web servers in order to employ a load balancing scheme. Clustering also provides for increased scalability by allowing new components to be added as the system load increases. In addition, clustering simplifies the management of groups of systems and their applications by allowing the system administrator to manage an entire group as a single system. Clustering may also be used to increase the fault tolerance of a network system. If one server suffers an unexpected software or hardware failure, another clustered server may assume the operations of the failed server. Thus, if any hardware of software component in the system fails, the user might experience a performance penalty, but will not lose access to the service.
Current cluster services include Microsoft Cluster Server (MSCS), designed by Microsoft Corporation for clustering for its Windows NT 4.0 and Windows 2000 Advanced Server operating systems, and Novell Netware Cluster Services (NWCS), among other examples. For instance, MSCS currently supports the clustering of two NT servers to provide a single highly available server. Generally, Windows NT clusters are xe2x80x9cshared nothingxe2x80x9d clusters. While several systems in the cluster may have access to a given device or resource, it is effectively owned and managed by a single system at a time. Services in a Windows NT cluster are presented to the user as virtual servers. From the user""s standpoint, the user is connecting to an actual physical system. In fact, the user is connecting to a service which may be provided by one of several systems. Users create TCP/IP session with a service in the cluster using a known IP address. This address appears to the cluster software as a resource in the same group as the application providing the service.
In order to detect system failures, clustered servers may use a heartbeat mechanism to monitor the health of each other. A heartbeat is a periodic signal that is sent by one clustered server to another clustered server. A heartbeat link is typically maintained over a fast Ethernet connection, private LAN or similar network. A system failure is detected when a clustered server is unable to respond to a heartbeat sent by another server. In the event of failure, the cluster service will transfer the entire resource group to another system. Typically, the client application will detect a failure in the session and reconnect in the same manner as the original connection. The IP address is now available on another machine and the connection will be re-established. For example, if two clustered servers that share external storage are connected by a heartbeat link and one of the servers fails, then the other server will assume the failed server""s storage, resume network services, take IP addresses, and restart any registered applications.
Clustering may also be implemented in computer networks utilizing storage area networks (SAN) and similar networking environments. SAN networks allow storage systems to be shared among multiple clusters and/or servers. The storage devices in a SAN may be structured in a RAID configuration. When a system administrator configures a shared data storage pool into a SAN, each storage device may be grouped together into one or more RAID volumes and each volume is assigned a SCSI logical unit number (LUN) address. If the storage devices are not grouped into RAID volumes, each storage device will typically be assigned its own target ID or LUN. The system administrator or the operating system for the network will assign a volume or storage device and its corresponding LUN to each server of the computer network. Each server will then have, from a memory management standpoint, logical ownership of a particular LUN and will store the data generated from that server in the volume or storage device corresponding to the LUN owned by the server. In order to avoid the problem of data corruption that results from access conflicts, conventional storage consolidation software manages the LUNs to ensure that each storage device is assigned to a particular server in a manner that does not risk an access conflict. For example, storage consolidation software may utilize LUN masking software to ensure that each server sees only a limited number of available devices on the network.
If a server fails, it is desirable to recover from the failure in a fast and economical manner that does not disrupt the other servers connected in the SAN. One method involves designating a spare or hot standby server. Several manual steps are required for integrating the spare server into the SAN network in the event of a failure. For example, the IP address and NetBIOS network name of the spare server must generally be reconfigured to match that of the failing server. The spare server is then connected to the SAN and brought online. Next, the storage consolidation software associated with the SAN must be reconfigured to allow the spare server access to the data on the SAN""s storage devices. In addition to requiring manual intervention, the use of a spare server also requires an additional server that is not being utilized for useful work. In addition, this method provides only a fair recovery time and cost of an additional server is somewhat prohibitive. Another approach is to troubleshoot and fix the failure in the field. The recovery time varies depending on the failure and may take a long time. For example, if the boot disk fails, the disk must be replaced and the OS needs to be reinstalled. If there is a hardware failure, the server needs to be offline until the troubleshooting is completed and the faulty component is replaced. As discussed above, another method for providing a fast recovery time from a server failure is to implement MSCS cluster software. Unfortunately, while this method provides an excellent recovery time, this method requires installing Windows NT 4.0 Enterprise Edition or Windows 2000 Advanced Server on every node. Because this software is costly and because conventional networks tend to utilize a large number of nodes, this solution is very expensive.
In accordance with teachings of the present disclosure, a system and method for providing backup server service in a multi-computer environment are disclosed that provide significant advantages over prior developed systems.
The present invention utilizes a cluster in a SAN storage consolidation group consisting of several stand-alone, non-clustered, servers, wherein the cluster also serves as the spare server. This cluster will have one standby recovery group for each non-clustered server. Each recovery group contains the IP address and network name of the associated stand-alone server. The recovery groups are preferably in the offline mode during normal operation. The cluster monitors the health of the stand-alone servers, preferably through the use of a heartbeat mechanism. If the cluster detects a failure, it will use the storage consolidation software associated with the SAN to reassign the LUNs owned by the failing server to the cluster. After the cluster has reassigned the LUNs, it will activate the recovery group containing the IP address and network name of the failing server. This will enable the cluster to assume the identity of the failing server and serve its users.
A technical advantage of the present invention is that Windows NT Enterprise Edition, Windows 2000 Advanced Server, or other expensive cluster software packages need not be installed on every server. As computer networks continue to expand and include more and more servers, the expense of installing cluster software on every server becomes a serious cost issue. As a result, significant cost savings can be realized from the present invention because cluster software need only be installed on one server, regardless of the size of the computer network.
Another advantage of the present invention is that the cluster is able to perform useful work and serve clients while also acting as a spare server for the stand-alone servers in the SAN storage consolidation group. Thus, unlike a hot or spare back-up server, the cluster is an active component in the computer network. As a result, the system administrator can maximize the investment made in the cluster because the cluster can perform several roles. Furthermore, because the recovery time only consists of the time required to detect the error, reassign the LUNs and activate the cluster resource group, the recovery time is excellent. The use of a cluster resource group is a much faster solution than integrating a hot or spare cluster into the computer network or troubleshooting the problem. In addition, a heartbeat mechanism may be implemented in a network that contains more nodes than a conventional cluster. The present invention allows for the heartbeat mechanism to be coupled to all the servers on the computer network. Other technical advantages should be apparent to one of ordinary skill in the art in view of the specification, claims, and drawings.