Clusters are increasingly used in computer networks. FIG. 1 depicts a block diagram of a conventional cluster 10. The conventional cluster 10 includes two computer systems 20 and 30, that are typically servers. Each computer system 20 and 30 is known as a node. Thus, the conventional cluster 10 includes two nodes 20 and 30. However, another cluster (not shown) could have another, higher number of nodes. Clusters such as the conventional cluster 10 are typically used for business critical applications because the conventional cluster 10 provides several advantages. The conventional cluster 10 is more reliable than a single server because the workload in the conventional cluster 10 can be distributed between the nodes 20 and 30. Thus, if one of the nodes 20 or 30 fails, the remaining node 30 or 20, respectively, may assume at least a portion of the workload of the failed node. The conventional cluster 10 also provides for greater scalability. Use of multiple servers 20 and 30 allows the workload to be evenly distributed within the nodes 20 and 30. If additional nodes (not shown) are added, the workload can be distributed between all nodes in the conventional cluster 10. Thus, the conventional cluster 10 is scalable. In addition, the conventional cluster 10 is typically cheaper than the alternative. In order to produce equivalent performance and availability as the conventional cluster 10, a large-scale computer system that is typically proprietary would be used. Such a large-scale computer system is generally expensive. Consequently, the conventional cluster 10 provides substantially the same performance as such a large-scale computer system while costing less.
FIG. 1 also depicts resource groups 22, 24 and 32 residing on the nodes 20 and 30. The resource groups 22, 24 and 32 define the components, both software and hardware, that are necessary to support one or more applications. Thus, the resource groups 22 and 24 and 32 can be considered to be virtual subsets of the nodes 20 and 30, respectively. The resource groups 22 and 24 and 32 also consume the resources of the nodes 20 and 30, respectively. Thus, the resource groups 22 and 24 and 32 use the CPUs, the memory, the disks, the public network and the interconnects for the nodes 20 and 30. For example, the types of resources could include file share, generic applications, generic services, IP addresses, network names, the physical disk, print spoolers and real time servers. A file share allows sharing of a directory on one of the disks in a configuration to give access to the directory to network clients. The file share requires a physical disk and a network name (described below). A generic application allows existing applications that are not aware of the fact they reside in a cluster 10 to operate under the control of cluster software. These existing applications can then fail over and are restarted if a problem occurs. The generic application has no mandatory resource dependencies. A generic service is defined by the user at the creation of the resource and has no resource dependencies. An IP address can be used to assign a static IP address and subnet mask to the network interface selected for the cluster 10. The IP address has no dependencies. The network name gives an identity to a resource group to allow client workstations to view the resource group as a single server. The network name has an IP address dependency. The physical disk is a physical disk (not shown) in the conventional cluster 10 and has no dependencies. A print spooler allows a common storage disk (not shown) to store print jobs that will be spooled. The print spooler requires a physical disk resource and a network name resource. A real time service maintains the date and time consistency between the nodes 20 and 30 of the conventional cluster 10. A particular resource group 22, 24 and 32 may use one or more of these resources as well as other resource. For example, a particular resource group 22 may include a particular application, physical disk subsystem, an IP address, a network name resource, a print spooler and a real time clock.
During operation of the conventional cluster 10, the resource groups 22, 24 and 32 may move between nodes 20 and 30. For example, if there is a failure in one of the nodes 20 or 30, the resource groups 22 and 24 or 32, respectively, moves to the remaining node 30 or 20, respectively. This allows the conventional cluster 10 to account for failures of one of the nodes 20 or 30. The resource groups 22, 24 and 32 may also move between the nodes 20 and 30 in order to allow the conventional cluster 10 to balance the load between the nodes 20 and 30.
Although the conventional cluster 10 provides the above-mentioned benefits, one of ordinary skill in the art will readily realize that it is desirable to monitor performance of the conventional cluster during use. Performance of the conventional cluster 10 could vary throughout its use. For example, the conventional cluster 10 may be one computer system of many in a network. One or more of the nodes 20 or 30 of the conventional cluster 10 may have its memory almost full or may be taking a long time to access its disk. Phenomena such as these result in the nodes 20 and 30 in the cluster 10 having lower than desired performance. Therefore, the performance of the entire network is adversely affected. For example, suppose there is a bottleneck in the conventional cluster 10. A bottleneck in a cluster occurs when a component of a node of the conventional cluster, such as the CPU of a node, has high enough usage to cause delays. For example, the utilization of the CPU of the node, the interconnects coupled to the node, the public network interface of the node, the memory of the node or the disk of the node could be high enough to cause a delay in the node performing some of its tasks. Because of the bottleneck, processing can be greatly slowed due to the time taken to access a node 20 or 30 of the conventional cluster 10. This bottleneck in one or more of the nodes of the conventional cluster 10 adversely affects performance of the conventional cluster 10. This bottleneck may slow performance of the network as a whole, for example because of communication routed through the conventional cluster 10. A user, such as a network administrator, would then typically manually determine the cause of the reduced performance of the network and the conventional cluster 10 and determine what action to take in response. In addition, the performance of the conventional cluster 10 may vary over relatively small time scales. For example, a bottleneck could arise in just minutes, then resolve itself or last for several hours. Thus, performance of the conventional cluster 10 could change in a relatively short time.
In addition, the resource groups 22, 24 and 32 may, as discussed above, move between the nodes 20 and 30. However, there is no conventional mechanism that allows the utilizations of hardware or other resources of the conventional cluster 10 that are associated with a particular resource group 22, 24 or 32 to be tracked. Thus, the effects of moving a resource group 22, 24 or 32 between the nodes 20 and 30 cannot be determined in advance. Thus, the performance of the conventional cluster 10 with respect to the resource groups 22, 24 or 32 cannot be analyzed.
Accordingly, what is needed is a system and method for studying and improving performance of the computer system that utilizes resource groups. The present invention addresses such a need.