As the use of open systems grows, managing data centers that may have hundreds or thousands of computer systems becomes an increasingly difficult task. Many data centers support large numbers of heterogeneous computer systems, running different operating systems and connected to a variety of networks, such as Storage Area Networks (SANs) and Internet Protocol (IP) networks. Many information technology (IT) managers are working to move from large numbers of small open systems, many running well below their capacities, to a much smaller number of large-scale enterprise servers running at or near their capacities. This trend in the IT industry is called “server consolidation.”
Computer systems in a data center may include large mainframe computers and/or very large servers, such as Hewlett-Packard's Superdome and Sun's Enterprise 10,000 (E10K), providing mainframe-like power using various physical and logical partitioning schemes. Such powerful machines enable server consolidation to be “scaled up” to a small number of powerful servers.
Data centers may also include servers that are symmetric multi-processor (SMP) systems, uniprocessor (UP) systems, and/or blade servers, which include a large number of blades (thin computer cards with one or more microprocessors and memory) that typically share a housing and a common bus. Blade servers enable server consolidation to be “scaled out,” so that the blade server becomes a “compute node” to which blade microprocessors can be allocated upon demand. Similarly, “virtual machines” enable computing power or memory to be provided by a number of processors which are called upon when needed.
Furthermore, the computer systems in a data center may support hundreds of application programs, also referred to as applications. These applications typically have different hardware resource requirements and business priorities, and one application may depend upon other applications. Each of these applications can have respective performance requirements, availability requirements, and disaster recovery requirements. Some application programs may run as batch jobs and have timing constraints (e.g., a batch job computing the price of bonds at a financial firm may need to end an hour before the next trading day begins). Other applications may operate best when resources are allocated as needed, such as stateless web servers and shared disk database applications. Single instance applications may run best on a single large machine with dynamic reconfiguration capabilities.
One early answer to the demand for increased application availability was to provide one-to-one backups for each server running a critical application. When the critical application failed at the primary server, the application was “failed over” (restarted) on the backup server. However, this solution was very expensive and wasted resources, as the backup servers sat idle. Furthermore, the solution could not handle cascading failure of both the primary and backup servers.
Enterprises require the ability to withstand multiple cascading failures, as well as the ability to take some servers offline for maintenance while maintaining adequate redundancy in the server cluster. Clusters of servers became commonplace, with either one server or multiple servers serving as potential failover nodes. Examples of commercially available cluster management applications include, VERITAS® Cluster Server, Hewlett-Packard® MC/Service Guard, and Microsoft® Cluster Server (MSCS).
N+1 clustering refers to multiple servers, each typically running one application, plus one additional server acting as a “spare.” When a server fails, the application restarts on the “spare” server. When the original server is repaired, the original server becomes the spare server. In this configuration, there is no longer a need for a second application outage to put the service group back on the “primary node”. Any server can provide redundancy for any other server. Such a configuration allows for clusters having eight or more nodes with one spare server.
N-to-N clustering refers to multiple application groups running on multiple servers, with each application group being capable of failing over to different servers in the cluster. For example, a four-node cluster of servers could support three critical database instances. Upon failure of any of the four nodes, each of the three instances can run on a respective server of the three remaining servers, without overloading one of the three remaining servers. N-to-N clustering expands the concept of a cluster having one backup server to a requirement for “backup capacity” within the servers forming the cluster.
N+1 and N-to-N clustering, however, provide only limited support should multiple servers fail, as there is no generally available method to determine which applications should be allowed to continue to run, and which applications should be shut down to preserve performance of more critical applications. This problem is exacerbated in a disaster recovery (DR) situation. If an entire cluster or site fails, high priority applications from the failed cluster or site can be started on the DR site, co-existing with applications already running at the DR site. What is needed is a process for managing information technology that enables enterprise applications to survive multiple failures in accordance with business priorities. An enterprise administrator should be able to define resources, machine characteristics, application requirements, application dependencies, business priorities, load requirements, and other such variables once, rather than several times in different systems that are not integrated. Preferably, resource management software should operate to ensure that high priority applications are continuously available.