1. Technical Field
The present invention relates generally to fault-tolerant computer networks and more particular to techniques to manage a set of resources in such a network to ensure that the resources remain available as a "group" to provide uninterrupted processing of software applications.
2. Description of the Related Art
It is known in the art to implement certain computer resources in a network in a redundant fashion for fault tolerance. One such technique involves a so-called "cluster" configuration wherein a given node of the network and its associated resources (e.g., its processor and disk drives) are essentially "mirrored" to ensure redundancy. A "highly-available" cluster typically includes a pair of nodes A and B, both of which can be doing active work. In the event of a failure of either node, the other node is capable of taking over some or all of the work of the failed node. A fully-redundant cluster configuration is one in which there are no single points of failure within the cluster configuration because the nodes, the data and the network itself are replicated for availability.
Tools for managing cluster configurations are also known in the art. Typically, a cluster is managed using a construct commonly referred to as a "resource group." A resource group is a set of highly-available resources, such as applications, network addresses and disks, which are expected to be taken over by a backup computer in the event of a failure of the primary computer. These resources are said to be in a "group" because it is necessary that they move together in a coordinated manner to the same backup system. In other words, the resources in a resource group need to "stay together" or be "collected" on the same computer at all times. Typically, a user is asked to define the contents of a resource group. This approach, however, creates serious problems. One main problem with user management of resource groups is that the users do not always get the membership of a particular resource group correctly defined. It is essential that all applications and their related resources be in the same resource group if they share any resources. By way of example, assume that the computer network includes a number of workstations running DB/2.TM. and Lotus Notes.RTM.. If these programs share the same IP address, then they need to be in the same resource group because a particular IP address cannot be taken over by two different computers. This need to create a sharing relationship between the applications, however, may be unknown to the user. A more subtle problem exists when the cluster configuration itself forces some type of artificial sharing that is less evident to the user creating the resource group. An example of this is when the implementation restricts disk takeover such that only a single computer can have control (and access to) a disk. In this example, the configuration itself mandates that all partitions on that disk can be taken over only by the same computer. As a result, there is "artificial sharing" between those applications that happen to use partitions residing on the same disk. This forces these applications to be collocated even though they do not appear, at first, to be sharing anything. If the administrator defining the resource group is not aware of this requirement, the resources group definition will encounter takeover failures under the right conditions.
Maintenance of the various resources in the cluster configuration often exacerbates the problem. For example, if a system administrator reorganizes which partitions fit on which disks, or if she changes the IP addresses used by various applications, such actions could affect the content of one or more previously-defined resource groups. It may not be apparent to the person carrying out these maintenance tasks that any resource group is being altered. Under appropriate circumstances, however, a key resource may not be in the appropriate resource group when it is needed, which is unacceptable.