Network reliability and security are of top concern for operational networks. To ensure performance of a network, operators conventionally utilize a wide range of measurement tools that continuously monitor behavior or network parameters of various network elements. A large set of network data may be collected for troubleshooting performance issues. For example, the network data may include router configuration, simple network management protocol (SNMP) performance statistics (e.g., computer processing units (CPUs)/memory utilization, packet/byte counts, etc.), router command logs and error logs, routing update trees, end-to-end latency and loss measurements, traffic traces, etc. The data sources may contain information relating to the health of a network. The ability of detecting unusual events (i.e., anomalies) in the data sources may serve as a basis for troubleshooting network performance issues. Operators often conduct further investigation on anomalous events to obtain details for network diagnosis and planning, to provide guidelines for service provisioning and billing, to gain insights for future network architectural design, etc.
Conventional anomaly detection systems focus on analyzing a single data source (e.g., traffic volumes, routing updates, etc.) in isolation. However, this approach includes major drawbacks that prevent this approach to be used widely in network operations. For example, this approach is specifically designed or manually tuned based on available data in order to achieve desirable performance and domain knowledge in detecting anomalies which are usually required in operational practice. In another example, even with fine tuned parameters, this approach may still generate false alarms that require further manual examination. The lack of a scalable and automated network anomaly detection system forces operators to either rely on naïve approaches (e.g., simple thresholding) or manually conduct visual anomaly detection in a small scale. As a result, this greatly limits the ability that operators have to detect and diagnose large scale network events.