The complexity of current computing systems and applications provided therein is quickly outgrowing the human ability to manage at an economic cost. For example, it is common to find data centers with thousands of host computing systems servicing hundreds to thousands of applications and components that provide web, computations and other services. In such distributed environments, diagnosis of failures and performance problems is an extremely difficult task for human operators. To facilitate diagnosis, commercial and open source management tools have been developed to measure and collect data from systems, networks and applications in the form of system metrics (i.e., data measurements), application metrics, and system and application event logs. However, with the large amounts of data collected, the operator is faced with the daunting task of manually going through the data, which is becoming unmanageable. These challenges have led researchers to propose the use of automated machine learning and statistical learning theory methods to aid with the detection, diagnosis and repair efforts of distributed systems and applications.
As referred herein, system and application event logs (hereinafter, “event logs” or “logs”) are records of system (both hardware and software) and application (software) events that have taken place in a system. Examples of event logs include but are not limited to failures to start a component or complete an action, system or application performance reaching predetermined thresholds, system or application errors, security events, network connection events. Each event entry typically includes a date stamp, a time stamp, and a message detailing the event. Unlike system metrics and application metrics, which contain structured numeric data, event logs are semi-structured and typically contain free text information. Event logs are essentially text messages written by the developers of the system and application. There are potentially many different messages. For example, it was found that there were more than 280,000 distinct event messages (after removing timestamps and fields containing numerical symbols only) in the event logs collected on one instance of an Information Technology (IT) system in a 9-month period.
Some prior solutions for diagnosing and repairing distributed systems and applications involve the use of search engines (e.g., as available from the Splunk Company of San Francisco, Splunk.com) or analysis modules (e.g., as available from LogLogic, Inc. of San Jose, Calif., loglogic.com) to perform indexing and parsing of the logs, whereby users have to provide adequate search queries to find desired information about the system or application health in the logs. Other prior solutions simply provide analyses of logs without correlating them with defined application or system health and typically require knowledge of the log structures and types of log messages a-priori. This leads to a finding of many types of data patterns in the logs that may not be important for diagnosing or forecasting a system or application behavior.