In order to provide for high-throughput of work, or nearly continuous availability, distributed computing systems are often utilized. A distributed computing system typically includes two or more computing devices which frequently operate somewhat autonomously and communicate with each other over a network or other communication path.
A computing device of a distributed computing system that has the capability of sharing resources is often referred to as a cluster which has two or more nodes, each node having a processor or at least a processor resource, and typically, a separate operating system. One example of a distributed computing system utilizing one or more clusters is a storage area network (SAN) which includes a storage controller.
A storage area network is frequently used to couple computer storage devices such as disk arrays, tape libraries, optical jukeboxes or other storage devices, to hosts in a manner which permits the storage devices to appear to the operating systems of the hosts as locally attached to the hosts. In operation, a host may request data from a storage controller which may in turn retrieve the data from one or more storage devices. The host may also transmit data to the storage controller to be written to one or more storage devices.
Each host communicates with the storage controller through a channel or communication path of the storage area network. Each communication path typically includes one or more physical hardware communication channels such as a digital electronic communication bus, a digital optical communication bus, or a similar communication channel. In addition, each communication path may include one or more logical control blocks, addresses, communication devices, digital switches, and the like for coordinating the transmission of digital messages between the host and the storage controller. Fibre Channel (FC) is often used in storage area networks and is a high speed networking technology in which signals may be transmitted over various transmission media including fiber optic cable or twisted pair copper cables, for example.
A storage controller may have multiple servers which are assigned input/output (I/O) tasks by the hosts. The servers are typically interconnected as nodes of one or more clusters in a distributed computing system, in which each node includes a server often referred to as a central electronics complex (CEC) server.
The I/O tasks may be directed to specific volumes in the storage. The storage controller may further have multiple input/output (I/O) adapters such as host adapters which enable the servers to communicate with the hosts, and device adapters which enable the servers of the storage controller to communicate with the storage devices. Switches may be used to couple selected servers to selected I/O adapters of the storage controller.
A distributed computing system is often referred to as a multi-node environment in which the various nodes communicate with each other by communication paths which link the various nodes together. Thus, in a cloud environment, the nodes of the distributed computing system may include hosts, in a network communication environment, the nodes of the distributed computing system may include servers, in a storage environment, the nodes of the distributed computing system may include storage facilities and embedded devices, and so on. Each pair of nodes and the communication path linking the pair of nodes to each other for communication between the two nodes of the pair, is referred to herein as a communication link.
In these environments, each node is typically a computing device installed with an operating system running software applications, including communication applications by which a node can learn the status of some or all of the communication links in the distributed computing system. For example, a node may transmit a “heartbeat” message to another node and wait to receive a corresponding heartbeat message from that node in return. If nodes fail to communicate with each other, there could be a bad node or a bad communication path linking the nodes. In some distributed computing system, all nodes of the system report the good or bad status of each communication link monitored by the nodes to a common node which may perform a communication failure isolation process to identify the particular node or communication path which has failed, resulting in the communication failure.
Various techniques have been proposed for identifying the particular node or communication path which is the cause of the communication failure. For example, in one technique, a thread generated by a monitor function on one node may loop through all nodes that it is monitoring to detect “node timeouts” which occur if the difference between the current time and the time of the last heartbeat message received from a particular node by the monitoring node is greater than a threshold value assigned to the particular node by the monitoring node. If the threshold is exceeded for a particular node being monitored, the monitoring node declares that particular node to be “dead” or failed.