The present invention relates generally to distributed networks, and in particular to core cluster functions for tracking access to data resources in a cluster environment.
As computer systems and networks become increasingly complex, the need to have high availability of these systems is becoming correspondingly important. Data networks, and especially the Internet, are uniting the world into a single global marketplace that never closes. Employees, sales representatives, and suppliers in far-flung regions need access to enterprise network systems every hour of the day. Furthermore, increasingly sophisticated customers expect twenty-four hour sales and service from a Web site.
As a result, tremendous competitive pressure is placed on companies to keep their systems running continuously, and to be continuously available. With inordinate amounts of downtime, customers would likely take their business elsewhere, costing a company their goodwill and a revenue loss. Furthermore, there are costs associated with lost employee productivity, diverted, canceled, and deferred customer orders, and lost market share. In sum, network server outages can potentially cost big money.
In the past, companies have run on a handful of computers executing relatively simple software. This made it easier to manage the systems and isolate problems.
But in the present networked computing environment, information systems can contain hundreds of interdependent servers and applications. Any failure in one of these components can cause of cascade of failures that could bring down a server and leave a user susceptible to monetary losses.
Generally, there are several levels of availability. The particular use of a software application typically dictates the level of availability needed. There are four general levels of systems availability: base-availability systems, high-availability systems, continuous-operations environments, and continuous-availability environments.
Base-availability systems are ready for immediate use, but will experience both planned and unplanned outages. Such systems are used for application development.
High-availability systems include technologies that significantly reduce the number and duration of unplanned outages. Planned outages still occur, but the servers also includes facilities that reduce their impact. As an example, high-availability systems are used by stock trading applications.
Continuous-operations environments use special technologies to ensure that there are no planned outages for upgrades, backups, or other maintenance activities. Frequently, companies also use high-availability servers in these environments to reduce unplanned outages. Continuous-operations environments are used for Internet applications, such as Internet servers and e-mail applications.
Continuous-availability environments seek to ensure that there are no planned or unplanned outages. To achieve this level of availability, companies must use dual servers or clusters of redundant servers in which one server automatically takes over if another server goes down. Continuous-availability environments are used in commerce and mission-critical applications.
As network computing is being-integrated more into the present commercial environment, the importance of having high availability for distributed systems on clusters of computer processors has been realized, especially for enterprises that run mission-critical applications. Networks with high availability characteristics have procedures within the cluster to deal with failures in the service groups, and make provisions for the failures. High availability means a computing configuration that recovers from failures and provides a better level of protection against system downtime than standard hardware and software alone.
Conventionally, the strategy for handling failures is through a failfast or failstop function. A computer module executed on a computer cluster is said to be failfast if it stops execution as soon as it detects a severe enough failure and if it has a small error latency. Such a strategy has reduced the possibility of cascaded failures due to a single failure occurrence.
Another strategy for handling system failures is through fault containment. Fault containment endeavors to place barriers between components so that an error or fault in one component will not cause a failure in another.
With respect to clusters, an increased need for high availability of ever increasing clusters is required. But growth in the size of these clusters increases the risk of failure within the cluster from many sources, such as hardware failures, program failures, resource exhaustion, operator or end-user errors, or any combination of these.
Up to now, high availability has been limited to hardware recovery in a cluster having only a handful of nodes. But hardware techniques are not enough to ensure that high availability hardware recovery can compensate only for hardware failures, which accounts for only a fraction of the availability risk factors.
An example for providing high availability has been with software applications clustering support. This technique has implemented software techniques for shared system resources such as a shared disk and a communication protocol.
Another example for providing high availability has been with network systems clustering support. With systems clustering support, failover is initiated in the case of hardware failures such as the failure of a node or a network adapter.
Another aspect of providing system availability is keeping track of the access to data resources such as a database, particularly when the database is distributed across a cluster. For example, an open request for a cluster database causes all of the member nodes to open their respective database. In the cluster environment, if the data resource remains open for use by clients, the database needs to be closed when the client routine terminates. When open everywhere across a cluster, the client accesses for each database must be accounted.
A global count has been typically used to serve this function. But a global access count, stored in a single source accessible by the cluster, has been difficult to use due to the processor time associated with gathering the information regarding access to a data resource and then processing the data to track each of the resources across the cluster. The tracking of this information is further complicated when nodes add or drop from the cluster, requiring further information management by a global access count.
Accordingly, a need exists for tracking the access to cluster data resources with respect to the open or closed state of the resource, and the accesses to the database by a client.
The present invention addresses the foregoing needs by providing for the maintaining of usage reference counts for replicated databases within a computer cluster using cluster membership and cluster voting services. Such a method includes the maintaining of a local reference count for all open distributed data resources within a given node, tracking by a group services client of those nodes that have the open distributed data resources, and using cluster membership services to update the local reference counts for node failures.
In one embodiment of the present invention, the foregoing method can be implemented within a computer cluster having a plurality of nodes, each having a proxy thread and a service thread, and a reference counter.
In yet another embodiment of the present invention, the method described above can be implemented as a computer program for operation within the computer cluster.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.