Information systems generate vast amounts and wide varieties of machine data (e.g., activity logs, configuration files, messages, database records). This data can be useful in troubleshooting systems, detecting operation trends, catching security problems, and measuring business performance. However, the challenge lies in organizing, searching, and reporting the data in a manner that allows a person to understand and use the data.
A conventional method for searching machine data involves storing the data into a database and then executing a search on the database. For example, existing large scale search engines like those by Google and Yahoo are designed to crawl the Internet in order to build a repository of hyperlinks. Once this information has been stored, it can be searched by a remote user. This process of building a repository can take hours or even days to complete depending on the size of the data set.
While the conventional database-oriented searches are appropriate for some situations, they are ill-suited for handling real-time searches. Real-time searches find information as soon as it is produced. With real-time searches, it is preferable to reduce the delay between the collection of data and the searching of the data. In conventional search systems, this delay is unavoidable and may be caused by a number of factors. For example, it is generally not efficient to continuously write data to a database as it is being collected. Thus, some conventional search systems wait until a sufficient amount of data is collected before accessing the database to store the data so that it is searchable. Such a delay may not seem significant, but for extremely time sensitive applications, even a 30 second delay can be important. For example, an IT administrator may want to understand patterns of machine data behavior from network devices in order to identify potential security threats. Time is of the essence when responding to security threats. Even a short delay in the processing of the machine data may result in vital information being compromised before the administrator can halt the attack.
Further, conventional search systems are inefficient at handling real-time searches. Suppose a user wants to generate a continuous report of machine data as it is being collected. With the conventional approach, a system would have to periodically (e.g., every few seconds) search the database for new machine data. However, modern databases can be multiple terabytes in size. Also, periodically searching such a huge database may consume a non-trivial amount of processing power and consume resources that could be put to better use elsewhere.