Modern data centers often include thousands of hosts that operate collectively to service requests from even larger numbers of remote clients. During operation, components of these data centers can produce significant volumes of machine-generated data. In order to reduce the size of the data, it is typically pre-processed before it is stored. In some instances, the pre-processing includes extracting and storing some of the data, but discarding the remainder of the data. Although this may save storage space in the short term, it can be undesirable in the long term. For example, if the discarded data is later determined to be of use, it may no longer be available.
In some instances, techniques have been developed to apply minimal processing to the data in an attempt to preserve more of the data for later use. For example, the data may be maintained in a relatively unstructured form to reduce the loss of relevant data. Unfortunately, the unstructured nature of much of this data has made it challenging to perform indexing and searching operations because of the difficulty of applying semantic meaning to unstructured data. As the number of hosts and clients associated with a data center continues to grow, processing large volumes of machine-generated data in an intelligent manner and effectively presenting the results of such processing continues to be a priority. Moreover, processing of the data may return a large amount of information that can be difficult for a user to interpret. For example, if a user submits a search of the data, the user may be provided with a large set of search results for the data but may not know how the search results relate to the data itself or how the search results relate to one another. As a result, a user may have a difficult time deciphering what portions of the data or the search results are relevant to her/his inquiry.
Determining which results (specifically events, as explained in greater detail below) are anomalous and presenting that information to the user is likely to help the user focus on the more relevant data in the search result. This is because an anomalous event is more likely to contain clues regarding solutions to a problem that the user may be attempting to address in conducting an inquiry. Existing systems provide a method, such as the one using the AnomalousValues command, to determine which events are anomalous. This existing method does not involve determining or calculating the probability of occurrence of the events. Instead, this existing method determines the probability of occurrence of the field values and designates as anomalous an event containing a field value with the lowest probability of occurrence. Alternatively, it may designate a set of events as anomalous, where each event in the set contains a field value with one of the lowest probabilities of occurrence. However, containing a field value with the lowest probability of occurrence is not an accurate indicator of the probability of occurrence of an event. In other words, an event that contains the field value with the lowest probability of occurrence is not necessarily an anomalous event. For example, a first event containing a field value with the lowest probability of occurrence may also contain many other field values with high probabilities of occurrence, whereas a second event that does not contain the field value with the lowest probability of occurrence, may contain many field values with low probabilities of occurrence. In such an example, the second event may be more anomalous than the first event. Accordingly, the above existing method of designating an anomalous event may inaccurately designate an event as anomalous.