Experts tasked with analyzing large pools of data on a routine basis often find themselves overwhelmed by the volume, format, sources, and content of the data they must process. In a typical scenario, events and incidents are identified by analysts through human analysis and piecemeal processing of data from tools such as search engines or medium-specific applications. This process is slow at best, and can miss many relevant data points simply due to the overwhelming volume of data available for a given event or trend. Two problems in particular commonly plague the analysis process: data heterogeneity and data overload.
The problem of data heterogeneity arises when data from multiple sources must be analyzed, each source utilizing a different data format or organizational scheme. For example, in the world of law enforcement, information analysts receive data from a large variety of sources. The data may be structured or unstructured and may be in a wide variety of file formats, such as documents, web pages, databases, data feeds, police reports, etc. Although individual law enforcement data centers may be able to process data from individual sources, it may be difficult to draw useful inferences or conclusions across multiple heterogeneous sources. For example, search tools may need to be separately configured for each individual data source in order to cull only the relevant materials from the data source.
In response to the problem of data heterogeneity, analysts have traditionally employed what is broadly known as a “collect and search” process. The collect and search process attempts to avoid the need to tailor individual search tools for each individual data source by simply collecting all available data from all data sources, whether relevant or not, and then conducting searches across the collected data to determine relevancy after the fact.
However, the simple collect and search process often results in the second common problem, i.e., data overload. First, the failure to make relevancy determinations during the collection process (a necessary byproduct of the decision not to employ data source-specific search tools during the collection process) often results in enormous amounts of data that must be stored, placing burdens on system memory. Second, the system must also search across the entire collection of stored data, most of which may not be relevant to the topic of the search. Not only does the overinclusiveness of the stored data place large burdens on search system performance, but it may result in large numbers of irrelevant data items being included in search results, placing a burden on human analysts to try to separate useful search results from statistical noise.
One variation on the traditional collect and search process is to index the data as it is being collected. Such indexing allows analysts to run searches on the indexed data to locate relevant information rather than over the entire raw data set. However, this approach becomes problematic when applied to large volumes of data. Creating an index both increases latency and requires considerable disk space to store the index. Furthermore, the effectiveness of the search is limited to the content or keywords appearing in the index, which necessarily excludes large portions of the content. Finally, because indexing may capture non-relevant data items just as easily as raw searching over non-indexed data, indexing may do little to eliminate the number of non-relevant search results, thus failing to relieve the burden placed on human analysts.
There is therefore a need for methods and systems for searching large volumes of data in near real-time that overcome the foregoing problems, among others.