A data center is a collection of computing devices that communicate with one another over a network and operate in conjunction to provide computing services and/or data storage services to one or more end users, where an end user can be an individual, an enterprise, or the like. The data center therefore includes numerous computing devices, numerous network infrastructure devices, such as routers, re-routers, switches, gateways, firewalls, virtual private networks (VPNs), bridges, etc., communications links between computing devices and network infrastructure devices, and communications links between network infrastructure devices. When providing the aforementioned services, data is transmitted through the network and between computing devices in the data center. The network infrastructure devices are configured to direct traffic through the network.
In conventional data centers, the network infrastructure devices include high-end devices, which tend to be relatively expensive. Recently, however, data centers have been configured to include numerous commodity (e.g., off-the-shelf) network infrastructure devices to decrease capital costs associated with the data center. While these commodity devices cost less than the “high-end” devices, commodity devices tend to be somewhat less reliable than the high-end devices, resulting in an increased burden on data center operators to ensure uninterrupted service. Resolving network failures, however, can be complex and thus time-consuming, as network infrastructure devices in a data center can be manufactured by numerous different manufacturers, as computing and/or network devices in the data center may have different operating systems installed thereon, as a manufacturer may generate different models of the same type of device, etc. Thus, there is a significant amount of heterogeneity in conventional data centers.
In relatively large data centers, an operations team is employed to ensure that the computing services and storage services promised to end users (e.g., in Service Level Agreements) are being met. Accordingly, when a network device (e.g., a computing device or a network infrastructure device) generates an alarm, the alarm is directed towards an operator console monitored by an operator on the operations team. The operator reviews the alarm and, based upon personal knowledge and experience (and possibly some static guidelines), the operator performs troubleshooting and debugging to try to either only mitigate (rather than diagnose) or fix the failure (by diagnosing the problem root cause) indicated by the alarm. While this approach may be suitable for relatively small data centers, such approach does not scale. For example, data centers are scaling to include hundreds of thousands of computing devices and several thousand network infrastructure devices. When particular events occur, a large number of alarms can be generated by devices in the data center in a relatively short amount of time. The operator must parse through the alarms to prioritize which alarms are to be initially addressed, and then typically uses a trial-and-error approach (potentially driven by pre-defined human-generated guidelines) to address alarms believed to be high priority. Due to the relatively high complexity of potential network problems, the operator may require a prolonged troubleshooting time window, which may result in service downtime.