In a mesh connected local area network, there are usually redundant interconnections between system components so that messages can be routed between any two network members in multiple ways. The network's switches monitor directly connected links and neighboring switches and set up appropriate tables so that messages are routed only through links and switches that are known to be available (i.e., which appear to be working properly). If any switch or link in the network is "not available" (e.g., not working, or disconnected), the network is configured to ignore the existence of these non-working components.
Whenever the network's switches detects a change in what is working, i.e., a component stops working or an additional component becomes available, this triggers a distributed "reconfiguration" process, by which the network redetermines the network topology and recalculates routing information. The processes for monitoring the status of links and switches, and the distributed process for reconfiguring the network are described at length in U.S. patent application Ser. No. 07/370,285, filed Jun. 22, 1989, entitled High-Speed Mesh Connected Local Area Network.
In general, each switch in the network includes hardware and software for automatically testing the status of the links connected to that switch. Like any self-diagnostic tool, it is not perfect in that it cannot detect every type of failure, especially intermittent failures. Thus, as in most systems, the ultimate test of whether a component is working is during actual use.
It is a premise of the present invention that every change in the status of a component imposes a certain amount of overhead on the system, such as requiring that the system reconfigure itself. Thus, it is often worse to repeatedly cycle back and forth between accepting a system component as working and then learning that it is broken that it would be to simply treat the component as broken. A component with a history of frequent, intermittent failure should only be reinstated when it has demonstrated that it can continuously remain in working condition for a period of time. Attempting to use a component that is broken can be harmful if it causes system users to loose information, or unnecessarily delays their work.
One prior art technique for avoiding interruptions caused by intermittently failing components is to allow only a limited number of failures during a specified amount of time. For instance, one could allow any component to fail no more than ten times per hour. That is, it will be allowed to change from "working" to "broken" status no more than ten times per hour. After ten transitions from "working" to "broken" during any one hour period, the component is simply treated as being "broken" until the end of the one hour period. Then the process starts all over again. Thus, if the component is fixed in the middle of the one hour period, its recovery will be delayed, but the system will be spared possibly hundreds or thousands of failures by the component.
Four criteria for properly limiting the failure rate of a component are as follows. (1) A component with a good history must be allowed to fail and recover several times without significant penalty. (2) In the worst case, a component's average long term failure rate must not be allowed to exceed some predetermined low rate. (3) Common behaviors shown by bad components should result in exceedingly low average long-term failure rates. (4) A component that stops being bad must eventually be forgiven its bad history.
The above described prior art "ten failures per hour" mechanism meets requirements 1, 2 and 4. Requirement 1 is met because a low number of failures (e.g., less than ten) doesn't result in the component being unused for a long period of time. Requirement 2 is met because in the worst case, the long term failure rate cannot exceed a specified number of failures per hour. Requirement 4 is met because once a broken component is fixed, and any remaining recovery time period left over from when it was broken expires, its use is no longer prevented.
Requirement 3 distinguishes the present invention from the prior art "ten failures per hour" mechanism. Regardless of the failure mechanism, this prior art technique will still allow a specified number of failures per hour.
The present invention does better than this by providing for a recovery period that increases every time that component is allowed to be used by the system and then fails. Thus, each time that a component "fools" the monitoring mechanism into allowing the component to be used only to find that the component soon thereafter fails, the recovery time period is automatically increased (up to a predetermined maximum). When a component has worked reliably for a long period of time, the recovery period is decreased for subsequent failures.