The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
“Big data” refers to any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. “Big data” is often difficult to work with using traditional relational database management systems, desktop statistics, and/or visualization packages. Instead, massively parallel software running on tens, hundreds, or even thousands of servers are often required.
For example, network flows (aka “netflows”, “s-flows”) have been a prevalent accounting record of network traffic for over a decade now. Netflow provides information about communications on a network via the following: source Internet Protocol (IP) address, destination IP address, protocol, start time, number of packets and byte count. They have historically been used in the enterprise for network capacity planning and application performance troubleshooting. Over time they've also been recognized as a reasonable method to identify information security threats and attacks. However, as the networks grow larger and larger with increasing number of IP addressable devices, the volume of enterprise or inter-enterprise data have become so large that is impractical to analyze them with traditional data systems/tools. To give an idea of the scale of possible netflow records in an enterprise, it is not uncommon for a Fortune 100 enterprise to generate over 3 Billion netflow records per day at the Internet Service Provider (ISP) layer of the enterprise' network, and the number grows dramatically larger if local area network (intranet) data are included. The problem is further complicated with netflows (and other big data applications) typically having skewed distribution of values (across the IP addresses), due to the nature of the records.