Network health is the key for application availability. Typically, network health is maintained with appropriate investment in infrastructure, e.g., capacity, scalability, redundancy and performance. In a healthy, well-prepared network, hardware can fail without warning and traffic can be routed through redundant paths. For example, in a typical data center, a router hierarchy (or fabric) consists of a multi-level or multi-layer set of hundreds or thousands of routers that work in conjunction to provide higher availability and redundancy.
Simple Network Management Protocol (SNMP) and other monitors and device traps along with Border Gateway Protocol (BGP) can identify when a network device completely fails, i.e., a black failure. Typically, black failures are easily mitigated without significant impact to the underlying applications. However, additional complexity arises when a network device is dropping some, but not all, packets, i.e., gray failures. Network device traps on SNMP often cannot detect these packet drops, a phenomenon referred to as silent packet drops. For example, if a router is dropping packets, neighbor routers (in the hierarchy) are not aware of the drops and continue to send traffic to the faulty router. In transmission control protocol (TCP), the routers resend packets that are dropped to mitigate loss with multiple retries before requests are eventually dropped. This results in unnecessary latency and inefficiency.
Unfortunately, because the network device traps on SNMP and other known monitors cannot detect the packet drops, and because there are a large number of possible paths in a typical hierarchy, faults can take on the order of tens of hours to detect and localize.
Overall, the examples herein of some prior or related systems and their associated limitations are intended to be illustrative and not exclusive. Upon reading the following, other limitations of existing or prior systems will become apparent to those of skill in the art.