The present invention relates generally to computer-based data analysis. In particular, the present invention relates to computer systems and methods for investigating and analyzing large amounts of data such as, for example, transaction logs of bank, call data records (CDRs), computer network access logs, e-mail messages of a corporation, or other potentially high-volume data that may contain up to billions to trillions of records.
Today, corporations, businesses, governmental agencies, and other organizations collect huge amounts of data, covering everything from e-mail messages, fined-grained web traffic logs to blogs, forums, and wikis. At the same time, organizations have discovered the risks associated with the constantly-evolving cyber security threat. These risks take many forms, including exfiltration, cyber fraud, money laundering, and damage to reputations. In an attempt to reduce these risks, organizations have invested in custom information technology projects costing hundreds of millions of dollars to manage and analyze collected data. These projects typically involve the creation of a data warehouse system for aggregating and analyzing the data.
Data warehousing systems have existed for a number of years, but current data warehousing systems are ill-suited for today's investigation challenges for a number of reasons. These include:
1. Scale: inability to accommodate up to petabyte-scale data sets that include up to billions or trillions of data records.
2. High-latency searches: search results to investigative queries should be returned in a matter of seconds, not hours or days.
3. Data Silo-ing: lack of consolidation of an organization's relevant data; instead, data collected by the organization is distributed throughout multiple disparate database systems that are incapable of reciprocal operation with one another; investigative searches for information require submitting a sub-search to each of the separate systems and aggregating the search results, possibly in different data formats, requiring development of time-consuming and expensive custom information technology components.
4. Loss of original data: data cannot be accessed in its original form, instead transformed versions of the data are presented during analysis potentially causing loss of valuable context.
The present invention attempts to address these problems and others, facilitating low latency searches of very large and possibly dynamic data sets in which search results present matching data in an original form.