The invention relates to cluster systems and to the management of a cluster system.
Traditionally, clustered systems of stand-alone server nodes have verified node-node connectivity based on being able to reach identified peer nodes via the data network (cluster interconnect fabric). As long as the list of (potential) member nodes is part of persistent cluster configuration information, this checking can take place initially at cluster definition time, and then be part of the normal run-time monitoring of the overall state of the cluster and the individual member nodes.
If communication is not possible for some node combination(s), then this will typically be reported, and an operator will have to verify both correct connectivity as well as the state of both nodes and interconnect components. At cluster definition (construction) time, this will typically be part of a procedure for verifying correct wiring and configuration.
In a complex system, a cluster interconnect structure can be complex both in terms of number of cables and connectors as well as in terms of the number of individual nodes, adapters and switches. For example, configurations can vary from simple two-node clusters to complex topologies with thousands of end-nodes.
Also, it may be necessary to perform a significant number of operations on a large number of individual components to determine state and/or verify connectivity. Hence, for any such cluster interconnect system, simplified and automated verification of correct connectivity and associated trouble shooting is desirable in order to maximize availability and minimize maintenance costs. This applies when a physical configuration is constructed initially, when dynamic connectivity changes take place due to component faults or operator generated errors, as well as when a configuration is being expanded or upgraded.
The ability to communicate between any relevant pairs of nodes does not inherently imply that the correct connectivity in terms of optimal topologies and path independence has been achieved. Hence, more sophisticated verification of the interconnect structure is required in order to make sure that the cluster is configured (physically and logically) according to rules of “best practice” for the relevant set of nodes.
In addition, new techniques for logical partitioning and virtualization at both the physical server node level as well as within the interconnect fabric significantly add to the complexity associated with constructing and maintaining cluster configurations. Among other things, this imposes a need to distinguish between physical (“mechanical”) connectivity in terms of cables and connectors between switches and (physical) cluster nodes and the logical connectivity in terms of which hosts (logical nodes) are able to communicate.
The present invention seeks to provide for improved management and diagnosis of cluster-based systems.