Many enterprises have been challenged by a shift to sophisticated and evolving cyber security threats. Attackers are increasingly applying stealthy attack techniques to help hide their presence or, at least, reduce the probability of being detected, e.g., by concealing their attack steps over multiple machines and exploiting different application protocols, or spreading their activities over long periods of time. Many of these threats are referred to as advanced persistent threats (APT).
Detecting and investigating such complex attack patterns requires the collection, storage, and analysis of events from a variety of vantage points, different data sources, and multiple abstraction layers. The monitoring data, often exported at rates of many thousands of events per second, needs to be collected, stored, and made available for real-time and historical analysis. With such a load and variety of relevant data types and varying collection delays, cyber security threat investigation has turned into a significant data problem. Many collected events only become meaningful when they are put into context across different data sources over potentially large time windows (such as weeks or months) to form a big picture of ongoing and past activities in the network and to filter out false alarms or anomalies having little or no impact.
Timely responses to such security incidents require near real-time analysis of the data, while investigations require access to historical data spanning large time windows. Existing solutions, however, process data in real time with a relatively small time window or only support historical data and require sequential access to the stored data. Input/Output (IO) limits become the dominating factor and existing solutions work around this by distributing the IO across large clusters of machines with increasing cost of setup and recombination of results.
A need exists for improved techniques for obtaining and processing raw data. A further need exists for a data processing system that permits (i) substantially real-time analysis of the data to provide a timely response to an incident; and (ii) access to historical data spanning large time windows to permit investigations.