Networked systems continue to grow in size and complexity as the number of networked devices and users of these devices increase. Communications over these networked systems generate massive amounts of data that may be useful for a variety of different purposes. Use of such data, however, requires the ability to store the data, and, in some cases, the data must be stored for extended periods of time. But, for many enterprises, it is not feasible to store the increasingly large amounts of data, as storage of such data may require tens, if not hundreds, of servers or more. Additionally, as the amount of data increases, the difficulty in effectively using the data for a desired purpose may also increase.
One such data source includes proxy server data. While proxy server data may be beneficial for an enterprise to capture and evaluate, it can also generate extremely large amounts of data requiring very large storage repositories. The proxy server data may generally include a log of any traffic into and out of a dedicated network. Such proxy server data may be useful for evaluating overall network traffic, as well as identifying specific user communications and potentially harmful traffic on the dedicated network.
In some cases, proxy server log data for a single day may include over 725 million log lines. To store such data for one month could require over 38 terabytes of storage space, which may require almost 6.5 servers, each configured with 6 terabytes of storage space. To store that much data for an entire year would require almost 80 servers. And, those servers are just for proxy server log data. Thus, for many enterprises it is not practicable to store such historical records of proxy server log data, despite the fact that this data may include valuable information.
Moreover, for many enterprises desiring to store historical data records of such magnitude, the time to search the raw proxy server log records for useful information could require impractically long search times (e.g., upwards of one week in some instances) to return a beneficial result. For cybersecurity applications that require real-time results to minimize or prevent the effects of an attack, for example, such a delay required for searching the log data is unacceptable.
The above challenges are not limited to proxy server data. Similar challenges arise with respect to maintaining historical records of firewall data and other network data including e-mail metadata, as well as general information system data such as Active Directory™ data. Any system for storing or maintaining increasingly large amounts of historical data must confront the challenges and costs associated with maintaining such data.
Thus, there is a need for new systems and methods to address the specific storage and analytical requirements for effectively managing large amounts of data, including proxy server data. Additionally, new systems are needed for improving the usefulness of large data records.