The present invention relates generally to distributed networks, and in particular to core cluster functions for maintaining consistency of shared data resources in a cluster environment.
As computer systems and networks become increasingly complex, the need to have high availability of these systems is becoming correspondingly important. Data networks, and especially the Internet, are uniting the world into a single global marketplace that never closes. Employees, sales representatives, and suppliers in far-flung regions need access to an enterprise network systems every hour of the day. Furthermore, increasingly sophisticated customers expect twenty-four hour sales and service from a Web site.
As a result, tremendous competitive pressure is placed on companies to keep their systems running continuously, and to be continuously available. With inordinate amounts of downtime, customers would likely take their business elsewhere, costing a company their goodwill and a revenue loss. Furthermore, there are costs associated with lost employee productivity, diverted, canceled, and deferred customer orders, and lost market share. In sum, network server outages can potentially cost big money.
In the past, companies have ran on a handful of computers executing relatively simple software. This made it easier to manage the systems and isolate problems.
But in the present networked computing environment, information systems can contain hundreds of interdependent servers and applications. Any failure in one of these components can cause of cascade of failures that could bring down your server and leave a user susceptible to monetary losses.
Generally, there are several levels of availability. The particular use of a software application typically dictates the level of availability needed. There are four general levels of systems availability: base-availability systems, high-availability systems, continuous-operations environments, and continuous-availability environments.
Base-availability systems are ready for immediate use, but will experience both planned and unplanned outages. Such systems are used for application development.
Second, high-availability systems include technologies that sharply reduce the number and duration of unplanned outages. Planned outages still occur, but the servers also includes facilities that reduce their impact. High-availability systems are used by stock trading applications.
Third, continuous-operations environments use special technologies to ensure that there are no planned outages for upgrades, backups, or other maintenance activities. Frequently, companies also use high-availability servers in these environments to reduce unplanned outages. Continuous-operations environments are used for Internet applications, such as Internet servers and e-mail applications.
Last, continuous-availability environments seek to ensure that there are no planned or unplanned outages. To achieve this level of availability, companies must use dual servers or clusters of redundant servers in which one servers automatically takes over if another server goes down. Continuous-availability environments are used in commerce and mission critical applications.
As network computing is being integrated more and more into the present commercial environment, the importance of having high availability for distributed systems on clusters of computer processors has been realized, especially for enterprises that run mission-critical applications. Networks with high availability characteristics have procedures within the cluster to deal with failures in the service groups, and make provisions for the failures. High availability means a computing configuration that recovers from failures and provides a better level of protection against system downtime than standard hardware and software alone.
Conventionally, the strategy for handling failures is through a failfast or failstop function. A computer module executed on a computer cluster is said to be failfast if it stops execution as soon as it detects a sever enough failure and if it has a small error latency. Such a strategy has reduced the possibility of cascaded failures due to a single failure occurrence.
Another strategy for handling system failures is through fault containment. Fault containment endeavors to place barriers between components so that an error or fault in one component would not cause a failure in another.
With respect to clusters, an increased need for high availability of ever increasing clusters is required. But growth in the size of these clusters increases the risk of failure within the cluster from many sources, such as hardware failures, program failures, resource exhaustion, operator or end-user errors, or any combination of these.
Up to now, high availability has been limited to hardware recovery in a cluster having only a handful of nodes. But hardware techniques are not enough to ensure high availability hardware recovery can compensate only for hardware failures, which accounts for only a fraction of the availability risk factors.
An example for providing high availability has been with software applications clustering support. This technique has implemented software techniques for shared system resources such as a shared disk and a communication protocol.
Another example for providing high availability has been with network systems clustering support. With systems clustering support, failover is initiated in the case of hardware failures such as the failure of a node or a network adapter.
Generally, a need exists for simplified and local management of shared resources such as databases, in which local copies of the resource is maintained at each member node of the cluster. Such efficient administrative functions aids the availability of the cluster and allows processor resources to be used for the execution and operation of software applications for a user.
Thus, provided herein is a method and apparatus for providing a recent set of replicas for a cluster data resource within a cluster having a plurality of nodes. Each of the nodes having a group services client with membership and voting services. The method of the present invention concerns broadcasting a data resource open request to the nodes of the cluster, determining the most recent replica of the cluster data resource among the nodes, and distributing the recent replica to the nodes of the cluster.
The apparatus of the present invention is for providing a recent set of replicas for a cluster data resource. The apparatus has a cluster having a plurality of nodes in a peer relationship, each node has an electronic memory for storing a local replica of the cluster data resource. A group services client, which is executable by each node of the cluster, has cluster broadcasting and cluster voting capability. A database conflict resolution protocol (xe2x80x9cDCRPxe2x80x9d), which is executable by each node of the cluster, interacts with the group services clients such that the DCRP broadcasts to the plurality of nodes a data resource modification request having a data resource identifier and a timestamp. The DCRP determines a recent replica of the cluster data resource among the nodes with respect to the timestamp of the broadcast data resource modification request relative to a local timestamp associated with the data resource identifier, and distributes the recent replica of the cluster data resource to each required node of the plurality of nodes.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.