Standards such as STP (Spanning Tree Protocol) and RSTP (Rapid STP) address automatically disabling and re-enabling links to manage traffic flow (e.g. prevent undesired loops).
In prior efforts, platforms used STP, RSTP, Virtual Router Redundancy Protocol (VRRP) or other Layer 2 (L2) Management Protocols to detect a fault, and then control the traffic flow recovery in a switch network attached to one or more processing elements. This is typically applied at the switch level where local link faults can be detected, usually via an Internet Control Message Protocol (ICMP) heartbeat mechanism over a link or link integrity failure. These approaches rely on disabling unneeded links and re-enabling links when needed to control traffic flow. However, the recovery is slow, involves outages and is limited to link control only on the switches.
In other approaches, a single central function (e.g. a master instance) is used to collect, count and threshold local link events to perform traffic flow recovery on a pair of switches.
Thus, a redundant monitoring technique is needed that operates across rack-based or shelf-based processing communication elements to monitor link paths and to perform notifications to trigger self-healing (auto repair) of local ports on all processing nodes in the system to maximize system availability.