A data center is a facility used to house computing hardware that can include servers, storage systems and telecommunications equipment. Typically a data center comprises multiple servers. Servers can be physical, virtual, or cloud-based machines. A data center, or more generally a computing system, can comprise software such as application software, system software and the like. The software can run on the servers.
A data center can comprise large clusters of related servers. As data center infrastructure grows, it becomes important to monitor the configuration and health of the servers automatically, and to alert an operator when anomalies occur. The configuration of the data center can include the configuration of hardware and software. The software configuration can include, for example, the configuration of application software and system software running on the servers.
A data center is a dynamic environment and anomalies can occur frequently in data center operations. Anomalies can be associated with software and/or hardware. It is common for software to be undergoing continuous deployment. The software environment is changing frequently, for example when software is being updated to a new version. Hardware changes are also frequent with machines being spun up and down, especially in virtual or cloud environments.
The environment can be chaotic and overwhelming for manual detection methods. It is not practical to sit and watch the operations all the time for anomalies. There is a need for methods and systems for automated anomaly detection in data center operations.
It can be hard to detect anomalies and they can cause significant disruption in computer systems and networks. Considerable effort can be spent trying to find anomalies. It is advantageous to have efficient automated ways to find anomalies in a timely fashion.
Examples of the detection of anomalies by monitoring, analysis or data-mining of system event logs have been discussed. The system and method described herein is related to anomaly detection through static and dynamic analysis of files, packages (such as installed software applications), and metadata.
Earlier work also discloses threshold-based approaches; for example alerting an operator when a disk is 90% full. The system and method described herein can identify and measure trends, and can anticipate problems before thresholds are triggered.