Field of the Invention
This invention relates generally to information organization, search, and retrieval and more particularly to time series data organization, search, and retrieval.
Description of the Related Art
Time series data are sequences of time stamped records occurring in one or more usually continuous streams, representing some type of activity made up of discrete events. Examples include information processing logs, market transactions, and sensor data from real-time monitors (supply chains, military operation networks, or security systems). The ability to index, search, and present relevant search results is important to understanding and working with systems emitting large quantities of time series data.
Existing large scale search engines (e.g., Google and Yahoo web search) are designed to address the needs of less time sensitive types of data and are built on the assumption that only one state of the data needs to be stored in the index repository, for example, URLs in a Web search index, records in a customer database, or documents as part of a file system. Searches for information generally retrieve only a single copy of information based on keyword search terms: a collection of URLs from a Website indexed a few days ago, customer records from close of business yesterday, or a specific version of a document.
In contrast, consider an example of time series data from a typical information processing environment, shown in FIG. 1. Firewalls, routers, web servers, application servers and databases constantly generate streams of data in the form of events occurring perhaps hundreds or thousands of times per second. Here, historical data value and the patterns of data behavior over time are generally as important as current data values. Existing search solutions generally have little notion of time-based indexing, searching or relevancy in the presentation of results and don't meet the needs of time series data.
Compared to full text search engines, which organize their indices so that retrieving documents with the highest relevance scores is most efficient, an engine for searching time series data preferably would organize the index so that access to various time ranges, including less recent time ranges, is efficient. For example, unlike for many modem search engines, there may be significantly less benefit for a time series search engine to cache the top 1000 results for a particular keyword.
On the other hand, given the repetitive nature of time series data, opportunities for efficiency of index construction and search optimization are available. However, indexing time series data is further complicated because the data can be collected from multiple, different sources asynchronously and out of order. Streams of data from one source may be seconds old and data from another source may be interleaved with other sources or may be days, weeks, or months older than other sources. Moreover, data source times may not be in sync with each other, requiring adjustments in time offsets post indexing. Furthermore, time stamps can have an almost unlimited number of formats making identification and interpretation difficult. Time stamps within the data can be hard to locate, with no standard for location, format, or temporal granularity (e.g., day, hour, minute, second, sub-second).
Searching time series data typically involves the ability to restrict search results efficiently to specified time windows and other time-based metadata such as frequency, distribution of inter-arrival time, and total number of occurrences or class of result. Keyword-based searching is generally secondary in importance but can be powerful when combined with time-based search mechanisms. Searching time series data requires a whole new way to express searches. Search engines today allow users to search by the most frequently occurring terms or keywords within the data and generally have little notion of time based searching. Given the large volume and repetitive characteristics of time series data, users often need to start by narrowing the set of potential search results using time-based search mechanisms and then, through examination of the results, choose one or more keywords to add to their search parameters. Timeframes and time-based metadata like frequency, distribution, and likelihood of occurrence are especially important when searching time series data, but difficult to achieve with current search engine approaches. Try to find, for example, all stories referring to the “Space Shuttle” between the hours of 10 AM and 11 AM on May 10, 2005 or the average number of “Space Shuttle” stories per hour the same day with a Web-based search engine of news sites. With a focus on when data happens, time-based search mechanisms and queries can be useful for searching time series data.
Some existing limited applications of time-based search exist in specific small-scale domains. For example, e-mail search is available today in many mainstream email programs and web-based email services. However, searches are limited to simple time functions like before, after, or time ranges; the data sets are generally small scale and highly structured from a single domain; and the real-time indexing mechanisms are append only, usually requiring the rebuilding of the entire index to interleave new data.
Also unique to the cyclicality of time series data is the challenge of presenting useful results. Traditional search engines typically present results ranked by popularity and commonality. Contrary to this, for time series data, the ability to focus on data patterns and infrequently occurring, or unusual results may be important. To be useful, time series search results preferably would have the ability to be organized and presented by time-based patterns and behaviors. Users need the ability to see results at multiple levels of granularity (e.g., seconds, minutes, hours, days) and distribution (e.g., unexpected or least frequently occurring) and to view summary information reflecting patterns and behaviors across the result set. Existing search engines, on the other hand, generally return text results sorted by key word density, usage statistics, or links to or from documents and Web pages in attempts to display the most popular results first.
In one class of time series search engine, it would be desirable for the engine to index and allow for the searching of data in real-time. Any delay between the time data is collected and the time it is available to be searched is to be minimized. Enabling real-time operation against large, frequently changing data sets can be difficult with traditional large-scale search engines that optimize for small search response times at the expense of rapid data availability. For example, Web and document search engines typically start with a seed and crawl to collect data until a certain amount of time elapses or a collection size is reached. A snapshot of the collection is saved and an index is built, optimized, and stored. Frequently accessed indices are then loaded into a caching mechanism to optimize search response time. This process can take hours or even days to complete depending on the size of the data set and density of the index. Contrast this with a real-time time series indexing mechanism designed to minimize the time between when data is collected and when the data is available to be searched. The ability to insert, delete and reorganize indices, on the fly as data is collected, without rebuilding the index structure is essential to indexing time series data and providing real-time search results for this class of time series search engines.
Other software that is focused on time series, e.g., log event analyzers such as Sawmill or Google's Sawzall can provide real-time analysis capabilities but are not search engines per se because they do not provide for ad hoc searches. Reports must be defined and built in advance of any analysis. Additionally, no general keyword-based or time-based search mechanisms are available. Other streaming data research projects (including the Stanford Streams project and products from companies like StreamBase Systems) can also produce analysis and alerting of streaming data but do not provide any persistence of data, indexing, time-based, or keyword-based searching.
There exists, therefore, a need to develop other techniques for indexing, searching and presenting search results from time series data.