Computer networks (within a cloud-computing environment, data-center environment, or other environment) generally comprise various interconnected computing devices that can communicate with each other via network packets to exchange data. When small numbers of devices are interconnected, the devices can be directly connected to each other. For example, one device can be directly connected to another device via a network link and the devices can communicate by sending packets to one another over the network link. However, having direct connections between large numbers of devices is not scalable. Thus, the connections between large numbers of devices will typically be via indirect connections. For example, one device can be connected to another device via an interconnection network comprising one or more routers.
An interconnection network can be created from a small number of large routers. However, large routers can be expensive and a small number of them may provide limited redundancy. Instead, an interconnection network can be constructed from lower cost commodity equipment interconnected as a network fabric. A network fabric can include multiple nodes interconnected by multiple network links. A node can include a networking device that can originate, transmit, receive, forward, and/or consume information within the network. For example, a node can be a router, a switch, a bridge, an endpoint, or a host computer. The network fabric can be architected or organized as a topology of the nodes and links of the communication system. For example, the network fabric can be organized as a multi-tier network fabric such that a packet traversing the network fabric passes through multiple intermediary nodes associated with the different tiers of the multi-tier network.
Typically, the networks provide a variety of options for redundancy and can tolerate faults in a system while continuing to run. However, quick detection of failing components in a repeatable and reliable way is desirable to maximize system throughput. Gray failures, however, often defy detection because they are, by definition, caused by failing components that have infrequent or irregular errors. Example gray failures include packet loss that is random, non-fatal errors, random I/O errors, etc. Gray failures are more difficult to detect because a device has not completely failed. Thus, traditional models for detecting failures, which assume a completely failed device or a device operating correctly, can often leave gray failures largely undetectable.