As the use of open systems grows, the complexity of managing hundreds or thousands of servers becomes an increasingly difficult task. In addition, a demand for increased availability of the applications running on the servers presents a challenge. Many information technology (IT) managers are working to move from large numbers of small open systems, many running well below their capacities, to a much smaller number of large-scale enterprise servers running at or near their capacities. This trend in the IT industry is called “server consolidation.”
One early answer to the demand for increased application availability was to provide one-to-one backups for each server running a critical application. When the critical application failed at the primary server, the application was “failed over” (restarted) on the backup server. However, this solution was very expensive and wasted resources, as the backup servers sat idle. Furthermore, the solution could not handle cascading failure of both the primary and backup servers.
Another possible solution is “N+1 clustering,” where one enterprise-class server provides redundancy for multiple active servers. N+1 clustering reduces the cost of redundancy for a given set of applications and simplifies the choice of a server for failover, as an application running on a failed server is moved to the one backup server.
However, N+1 clustering is not a complete answer to the need for increased application availability, particularly in a true server consolidation environment. Enterprises require the ability to withstand multiple cascading failures, as well as the ability to take some servers offline for maintenance while maintaining adequate redundancy in the server cluster. Typical cluster management applications provide only limited flexibility in choosing the proper hosts for potentially tens or hundreds of application groups. Examples of commercially available cluster management applications include VERITAS® Global Cluster Manager™, VERITAS® Cluster Server, Hewlett-Packard® MC/Service Guard, and Microsoft® Cluster Server (MSCS).
N-to-N clustering refers to multiple application groups running on multiple servers, with each application group being capable of failing over to different servers in the cluster. For example, a four-node cluster of servers could support three critical database instances. Upon failure of any of the four nodes, each of the three instances can run on a respective server of the three remaining servers, without overloading one of the three remaining servers. N-to-N clustering expands the concept of N+1 clustering from a “backup system” to a requirement for “backup capacity” within the servers forming the cluster.
What is needed is a business continuity policy that enables critical enterprise applications to survive multiple failures by determining suitable systems for starting applications initially, redistributing applications when systems reach an overloaded condition, and restarting failed applications.