1. Field of Invention
This invention relates to fault detection in a distributed network, and more specifically, to a mechanism for peer-to-peer fault detection.
2. Discussion of Prior Art
In modern systems, a distributed network is defined as a group of systems or nodes, of equal processing capabilities, that can intercommunicate in a networking manner. Each system has its own tasks and processes. Nonetheless, a particular task may require resources available in remote systems, involving the use of a networking environment.
Two clear examples of a distributed network are local area networks (LANs) and distributed control networks. A LAN comprises a group of personal computers, generally controlled by users, that intercommunicate through shared physical media to transfer information among them. In a distributed control network, several similar controllers are programmed to execute a control strategy or algorithm to control a specific process. A complex process is divided into simpler processes suitable for simple control devices, and each controller is responsible for a fraction of the control strategy. The control network coordinates the whole control strategy, since actions performed by a given controller may affect actions performed by a remote controller present in the same control network.
In these examples, close relationships are usually established between nodes, and any node may depend on one or more nodes. For example, a local area network may enclose a personal computer responsible for storing information in the form of a database accessible from remote computers. A distributed control network may include a controller responsible for checking the on/off state of an element of the controlled process and for the propagation of said information to other controllers in the network. In both examples, failure of the provider node (i.e., database server and sensing controller) affects all other nodes that depend on it.
Finding and fixing an inoperative node is a time consuming task. In many cases, the entire system must be stopped to replace or fix the defective node. There is the need for a mechanism to minimize these situations.
Corrective and/or preventive measures can be taken to keep the network in operation. Corrective measures are usually implemented as redundancy. A redundant system involves the use of duplicate equipment (e.g., primary and secondary equipment), such as duplicate physical network lines, power supplies, I/O peripherals, etc. The primary equipment is always in use. In case the primary equipment fails, secondary equipment may become active, and an alarm be sent to the user or operator indicating failure of primary equipment. By this mechanism, defective devices can be replaced without stopping the system.
Although redundancy solves the problem, it involves high equipment costs (due to duplication), despite its infrequent use.
A preventive mechanism proposes a periodical monitoring of every node present in the network to keep track of the status of each node. Then, a failure may be anticipated and a priori fault-correction actions taken with little or no impact due to network failure. The majority of network failures may be avoided by adding special monitoring devices or special diagnosis tests on each node. Compared to redundant measures, this preventive approach represents a cost-effective solution. The preventive approach is the subject of the present invention.
In the recent past, there have been many attempts to solve the monitoring and status report problem. A proposed solution is the addition of special network devices that may ask all or a group of network nodes for their status. Any irregular status or the absence of a response by a node is reported to the user or to another station or system capable of handling such situation. Yet, adding specialized devices embodies a situation similar to the above physical redundancy solution, i.e., Increasing the costs for establishing a reliable network. Furthermore, the failure of the specialized monitoring device represents the failure of the monitoring mechanism itself, since the fault and status detection functions are centralized in such devices.
Rather than using centralized monitoring devices, the fault detection functions may be integrated into the network nodes themselves, thus eliminating the existence of external monitoring devices. Every node may have the capability of checking the operation of all other nodes present in the network, maintaining a dynamic table of existing nodes. In a computer network, for example, this mechanism can be easily supported. However, controllers used in a control network do not always possess enough memory to store a dynamic table. This problem arises as soon as the number of nodes increases, since the size of the dynamic table must be increased too. Furthermore, the computation time needed to check the status of all other nodes increases proportionally with an increase in the number of nodes.
This invention proposes a cost-efficient fault and status detection method suitable for different types of networks, involving low computation time and memory requirements, regardless of the number of nodes in the network.