The rapid increase in the production and collection of machine generated data has created relatively large data sets that are difficult to search. The machine data can include sequences of time stamped records that may occur in one or more usually continuous streams. Further, machine data often represents some type of activity made up of discrete events.
Searching data requires different ways to express searches. Search engines today allow users to search by the most frequently occurring terms or keywords within the data and generally have little notion of event based searching. Given the large volume and repetitive characteristics of machine data, users often need to start by narrowing the set of potential search results using event-based search mechanisms and then, through examination of the results, choose one or more keywords to add to their search parameters. Timeframes and event-base metadata like frequency, distribution, and likelihood of occurrence are especially important when searching data, but difficult to achieve with current search engine approaches.
Also, users often generate arbitrary queries to produce statistics and metrics about selected data fields that may be included in the data. Indexing may enable raw data records to be identified quickly, but operations that examine/scan the individual data records may become prohibitively expensive as the size of the data set grows. Certain Extract, Transform, Load (ETL) based database systems in use today allow for data to be transformed during the data ingestion process for storing in a proper format and structure for purposes of querying and analysis. The shortcoming of such database systems is that certain information, e.g., select data fields not designated for extraction by a user at data ingestion time, is discarded and, therefore, cannot be retrieved if required at a later time. Consequently, a user needs to pre-specify the data fields that need to be extracted from the raw data records at data ingestion time which makes these database systems rather inflexible. As storage capacity becomes cheaper, there are fewer incentives to discard the unused portions of the raw data records. Thus, systems that can search relatively large sets of data are the subject of considerable innovation.