Distributed computing systems have found application in a number of different computing environments, particularly those requiring high performance and/or high availability and fault tolerance. In a distributed computing system, multiple computers connected by a network are permitted to communicate and/or share workload. Distributed computing systems support practically all types of computing models, including peer-to-peer and client-server computing.
One particular type of distributed computing system is referred to as a clustered computing system. “Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a client or user, the nodes in a cluster appear collectively as a single computer, or entity. In a client-server computing model, for example, the nodes of a cluster collectively appear as a single server to any clients that attempt to access the cluster.
Clustering is often used in relatively large multi-user computing systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
In many clustered computer systems, the services offered by such systems are implemented as managed resources. Some services, for example, may be singleton services, which are handled at any given time by one particular node, with automatic failover used to move a service to another node whenever the node currently hosting the service encounters a problem. Other services, often referred to as distributed services, enable multiple nodes to provide a service, e.g., to handle requests for a particular type of service from multiple clients.
Resources such as cluster-provided services are typically managed through the use of various types of policies that are necessary for some aspect of a resource's existence. A policy, in general, is any set of rules that may be used to manage the existence and operation of one or more resources, and includes, for example, activation or high availability policies, security policies, rights policies, and other types of management policies. An activation policy may be used, for example, to select a particular node or nodes to use to host a service, and/or to manage how failover occurs in response to a node failure. A security policy may be used, for example, to determine what resources particular users are permitted to access and/or what types of operations those users are permitted to perform. A rights policy may be used, for example, to control access to digital content.
In many distributed computer systems, cluster-provided services may represent but one kind of resource. Furthermore, different types of resources may be dependent upon one another, e.g., requiring one type of resource to be managed in a manner that is consistent with the management of another type of resource.
For example, some application server environments manage any cluster-provided services, e.g., transaction services, messaging services, etc. that run within such environments. In many instances, such application server environments incorporate integrated high availability managers that manage the activation of individual instances of a cluster-provided service on each node. The domain of a high availability manager incorporated into an application server environment, however, is typically constrained to those resources that are provided by the environment itself. Such constraints, however, can complicate the management of resources based upon the requirements of other resources that exist externally from the application server environment.
As one example, a transaction service typically maintains a log of transactions to assist in recovery of failures. The log must be stored in persistent memory such as a SAN disk so the data maintained thereby is not lost as a result of a failure. Management of a log may be provided by a distributed resource such as a Journal File System (JFS), and in many environments, a JFS file system can only be mounted on one node of a distributed computer system at a time. To ensure access to the log, therefore, a transaction service is often required to be active on the same node as, i.e., be collocated with, the JFS file system within which the log is managed.
A separate high availability manager, such as a middleware-based or operating system-based high availability manager, typically manages JFS file systems and other similar resources in a distributed computer system. In conventional designs, however, there is no interaction between the high availability manager of an application server environment and other resource managers, thus precluding one high availability manager from being able to make resource management decisions that are dependent upon the status of resources that are managed outside of that manager's domain.
In the above example, therefore, a conventional high availability manager integrated into an application server environment may be incapable of independently determining where to activate a transaction service, as the manager is typically not aware of upon which node the JFS file system is currently active.
Conventional systems have typically addressed this limitation by requiring an application server to be started on a node using a script under the control of an external high availability manager. The script identifies the location of any resource that may need to be collocated with any resources being managed by the application server environment, such that when the application server initializes, the high availability manager therefor can activate any resources that depend on any externally-managed resources on the proper nodes.
Starting an application server, however, is often time consuming, and may lead to several minutes of downtime before the server can resume activities. Given the goal of continuous accessibility in a distributed computer system, even a few minutes of downtime is highly undesirable.
Therefore, a significant need exists in the art for a faster and more efficient manner of coordinating the management of resources that are dependent upon other, externally-managed resources.