The amount of data generated by various machines (e.g., appliances, servers, software tools, etc.) connected in an organization is enormous. The machine-generated data may be in a structured textual format, an unstructured textual format, or a combination thereof. Examples for such machine-generated textual data include logs, metrics, configuration files, messages, spreadsheets, events, alerts, sensory signals, audit records, database tables, and so on.
The vast amount of machine-generated textual data requires information technology (IT) personnel to effectively review, analyze and respond to countless unwanted emails, messages, notifications, alerts, tickets, and the like to identify a specific malfunction. The ability of a person (e.g., an IT administrator) to react to such high volumes of data is limited by the rate of processing, skills, and experience of the person. Further, responding to malfunction incidences requires knowledge of not only the type of malfunction, but a type of response that is suitable for resolving the incident. For simple, frequent problems, it may be possible to build a “playbook” for responding to incidences. However, such playbooks will have countless numbers of escalation directives, which may occur frequently. As a result, this response determination by a human operator becomes impossible since the number of combinations of machines, types of malfunctions, and appropriate responses increases exponentially as the number of machines and complexity of those machines increase.
More specifically, human operators face several significant challenges that prevent them from effectively identifying and addressing incidences. First, a problem is often reported with respect to an open-ended problem that may be the result of a long, branching series of events. For example, a user report indicating that the user interface of an application is unusually slow may have multiple branching potential causes. Second, problems are often indicated by metrics (e.g., CPU, RAM, disk, network, etc.) that provide poor information regarding the problem itself. Third, the point at which a human operator may start investigating an incident is often far removed from the root cause such that determining the root cause is practically impossible. Due to these and other challenges, playbooks for solving incidences usually focus on gathering data and improving the starting point of the investigation, rather than solving the problem that caused the incident.
As an example, when a human operator reviews a ticket stating that the “workstation takes very long to log in,” that text provides little context as to the cause of the slow login. Although the human operator may have access to a knowledge management system, he likely does not know which information to request in order to find a past incident that would have a relevant resolution. As another example, when the human operator investigates an incident that is rich with information (e.g., if the incident results in 20 different anomalies, with some anomalies occurring in a database and others occurring in webservers), the incident may be too complex for the human operator to parse through, let alone to allow for sufficiently fast responses.
Moreover, a user that needs to process such large volumes of data may wish to gain visibility as to the performance of the entire IT systems in the enterprises and determine a root cause for reported malfunction. To determine the causality between reported alerts, data received from different domains (e.g., network, infrastructure, and application) should be processed. Each such domain has its own domain-expert. Thus, the challenge of determining the root cause of each incident is amplified. For example, the machine-generated textual data may include readings indicative of a high-CPU utilization and security logs indicative of new viruses. Currently, IT personnel have no effective way to determine any causality between these reported inputs.
Existing solutions cannot resolve the deficiencies noted above, as such solutions operate in silos. That is, the creation of machine-generated textual data and reading of such data are performed by different solutions (components), which are not necessarily developed by the same vendors. Furthermore, some existing solutions for digital events-ingestion merely aggregate machine-generated data and provide search capabilities across the aggregated data. Other solutions are limited to processing a specific set of textual data generated by common tools. However, such solutions typically do not cover the entire spectrum of machines installed in an organization and are not adapted to cover the entire set of logs, events, and other data generated by the machines. Therefore, meaningful and important information may not be detected or otherwise analyzed by such solutions.
Existing solutions fail to detect a root cause of a malfunction and, therefore, cannot determine appropriate recommendations for addressing root causes. A malfunction can be a system error or failure, an application error, and the like. This deficiency is largely due to at least the following challenges: the need to query multiple data-sources storing data in different structures at the same time; the structure of machine-generated data not being standardized; the data being formatted with the intention that the data is to be ingested by a human rather than a computer; the machine-generated data including a mixture of the original events, wrapped with unrelated additional information (e.g., Syslog transport headers added by relay servers); and the same data being serialized in several formats (e.g. JSON, XML).
Moreover, some solutions utilize manually provided descriptions of incidents in an attempt to better categorize incidents and, consequently, mitigating actions. However, these descriptions are frequently inaccurate, particularly when users confuse such descriptions and recommended mitigation actions with feedback. This leads to misclassification of incidences and, therefore, corresponding recommendations. Further, different users may provide variations on the same descriptions, which may result in descriptions that are essentially the same being applied differently. Thus, any relationships between incidences and descriptions thereof ultimately rely on manual inputs and, thus, are often inaccurate.
As a result of the deficiencies of existing solutions, machine-generated textual data is often analyzed by humans. Of course, any manual analysis is prolonged, requires human resources, and affects the overall performance of the enterprise. A major drawback of this approach is that the amount of data that can be processed by users such as IT personnel is limited by restraints on human resources. Due to the size, variety, retention, and dynamic nature of machine-generated data that continues to grow, a manual approach for solving the above-noted tasks is inefficient.
It would therefore be advantageous to provide a solution that would overcome the deficiencies of the prior art.