Electronic computing has evolved from primitive, vacuum-tube-based computer systems, initially developed during the 1940s, to modern electronic computing systems in which large numbers of multi-processor computer systems, such as server computers, work stations, and other individual computing systems are networked together with large-capacity data-storage devices and other electronic devices to produce geographically distributed computing systems with hundreds of thousands, millions, or more components that provide enormous computational bandwidths and data-storage capacities. These large, distributed computing systems are made possible by advances in computer networking, distributed operating systems and applications, data-storage appliances, computer hardware, and software technologies.
In order to proactively manage a distributed computing system, system administrators are interested in detecting anomalous behavior in the operation of the disturbed computing system. Management servers have been developed to collect thousands of different metrics from numerous and various resources of a distributed computing system and event messages from numerous and various event sources running in the distributed computing system. Examples of resources include virtual and physical resources, such as CPU, memory, data storage, and network. Examples of the types of metric data include CPU usage, memory, data storage, and network traffic of a virtual or a physical object. An event source can be an application program, an operating system, a virtual machine, or a container. Each event message describes an event, which can be a status report, input, output, warning, fault, or error in the execution of the event source. However, metric data and event messages are recorded by management servers at a high frequency, such as sub-second frequency, creating high density data sets. As a result, the data sets can become extremely large, which increases the cost of data storage and processing. In addition, management servers push the limits of memory, CPU usage and input/output of server computers to process the extremely large data sets, which drastically slows the determination of behavior patterns, detection of anomalies, identification of problems, and characterization of the data and slows implementation of responses to patterns, anomalies, and problems. System administrators seek methods and systems to analyze the enormous amounts of metric data and event messages.