The usage of the Internet has increased dramatically over the last several years. Many popular websites receive millions of “hits” each day. Consequently, the network servers providing content for these websites have experienced a dramatic increase in their workload. In order to process such substantial amounts of network traffic without subjecting clients (surfers) to annoying delays in retrieving web pages, it is advantageous to distribute the applications (or services) handling this traffic among multiple web server hardware nodes, so that the multiple server nodes can operate in parallel to process the network traffic.
A cluster is a collection of coupled computing nodes that provides a single client view of network services or applications, including databases, web services, and file services. In other words, from the client's point of view, a multinode computer cluster operates to provide network services in exactly the same manner as a single server node. Each cluster node is a standalone server that runs its own processes. These processes can communicate with one another to form what looks like (to a network client) a single system that cooperatively provides applications, system resources, and data to users.
A cluster offers several advantages over traditional single server systems. These advantages include support for highly available and scalable applications, capacity for modular growth, and low entry price compared to traditional hardware fault-tolerant systems.
A service that spreads an application across multiple nodes to create a single, logical service is called a scalable service. Scalable services leverage the number of nodes and processors in the entire cluster on which they run. One node, called the Global Interface Node or GIF node, receives all application requests and dispatches them to multiple nodes on which the application server is running. If this node fails, the global interface fails over to a surviving node. If any of the nodes on which the application is running fails, the application continues to run on the other nodes with some performance degradation until the failed node returns to the cluster.
If any of the aforementioned network server nodes fails, it is desirable that other nodes take over the services provided by the failed node such that the entire system remains operational. High Availability (HA) is the ability of a cluster to keep an application up and running, even though a failure has occurred that would normally make a server system unavailable. Therefore, highly available systems provide nearly continuous access to data and applications.
It is well known in the art that an application, such as a web server, needs to be specially configured to be able to run as a highly-available or scalable application on a computer cluster. Specifically, there must be provided a special program called resource type, that would start an instance of the application, monitor application's execution, detect failure of the application and start another instance of the application if the first instance fails. The term data service will be used herein to describe a third-party application such as Apache web server that has been configured to run on a cluster rather than on a single server. A data service includes the application software and special additional container process called resource type that starts, stops, and monitors the application.
One example of the aforementioned resource type is a failover resource type, which is the process by which the cluster automatically relocates an application from a failed primary node to a designated secondary node. Failover services utilize the aforementioned fail over resource type. In other words, failover resource type is a container for application instance resources.
For failover data services, application instances run only on a single node. If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured. With failover, a clustered computer system provides high availability.
When a failover occurs, clients might see a brief interruption in service and might need to reconnect after the failover has finished. However, clients are not aware of the physical server from which they are provided the application and data.
Another example of a data service is a scalable data service. The scalable data service has the potential for running active instances of an application on multiple cluster nodes. Scalable services utilize a scalable resource type to start, stop, and monitor the application. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running at once.
Service requests come into the cluster through a single network interface (the global interface or GIF) and are distributed to the nodes based on one of several predefined algorithms set by the load-balancing policy. The aforementioned load-balancing policy is a set of rules describing how the network traffic should be distributed among nodes of a clustered computer system. The cluster can use the load-balancing policy to balance the service load between several nodes. Note that there can be multiple GIFs on different nodes hosting other shared addresses.
For scalable services, application instances run on several nodes simultaneously. If the node that hosts the global interface fails, the global interface fails over to another node. If an application instance running fails, the instance attempts to restart on the same node.
If an application instance cannot be restarted on the same node, and another unused node is configured to run the service, the service fails over to the unused node. Otherwise, it continues to run on the remaining nodes, possibly causing a degradation of service throughput.
In a conventional highly available system the implementation of resource types, such as failover resource type or scalable resource type, are created manually by a developer for each highly available application. Unfortunately, creating appropriate resource types manually requires significant amounts of time and effort, and therefore, it is time consuming and expensive.
Accordingly, it would be highly advantageous to have a tool that would automate the process of creation of the aforementioned resource types based on characteristics of a particular clustered computer system and parameters specified by the user. It would also be advantageous to have a tool that would generate utility scripts for starting, stopping and removing an instance of the resultant resource type.