Datacenter networks often comprise tens of thousands of components including servers, links, switches and routers. To reduce capital expenses, many datacenters are being built with inexpensive commodity hardware. As a result, network failures are relatively frequent, as commodity devices are often unreliable.
Diagnosing and repairing datacenter networks failures in a timely manner is a challenging datacenter management task. Traditionally, network operators follow a three-step procedure to react to network failures, namely detection, diagnosis and repair. Diagnosis and repair are often time-consuming, because the sources of failures vary widely, from faulty hardware components to software bugs to configuration errors. Operators need to consider many possibilities just to narrow down potential root causes.
Although some automated tools exist to help localize a failure to a set of suspected components, operators still have to manually diagnose the root cause and repair the failure. Some of these diagnoses and repairs need third-party device vendors' assistance, further lengthening the failure recovery time. Because of the above challenges, it can take a long time to recover from disruptive failures, even in well-managed networks.