1. Field
Embodiments of the invention generally relate to network traffic analysis. More particularly, examples of the invention are directed to methods, systems, and/or computer programs for capturing and analyzing network flow data.
2. Description of the Related Art
Network traffic usage data is of interest to network administrators for a number of reasons, including analyzing the impact of a new application on the network, troubleshooting network pain points, detecting heavy users of bandwidth, and securing networks. Network usage data does not include the actual information exchanged in a communications session between parties, but rather includes numerous usage detail records, known as “flow records” containing one or more types of metadata. The primary protocol associated with traffic flow data is NetFlow which was developed by Cisco Systems®. There are also several other varieties of flow protocols, such as sFlow, IPFIX, Jflow, NetStream, and Cflowd. All of these protocols support flows that are similar to NetFlow and contain similar types of information, such as source internet protocol (IP) address, destination IP address, source port, destination port, IP protocol, ingress interface, IP Type of Service, start and finish times, number of bytes, and next hop.
In general, a flow record provides detailed usage information about a particular event or communications connection between parties, such as the connection start time and stop time, source (or originator) of the data being transported, the destination or receiver of the data, and the amount of data transferred. A flow record can summarize usage information for very short periods of time (from milliseconds to seconds, occasionally minutes). Depending on the type of service and network involved, a flow record may also include information about the transfer protocol, the type of data transferred, the type of service (ToS) provided, etc.
As networks become larger and more complex, systems that analyze and report on traffic flow data must become more efficient at handling the increasing amount of information generated about network traffic. Aggregating data from many network devices can result in datasets that contain billions of entries or flows. Such a large number of entries can create a bottleneck in the system because writing to storage can be time consuming. Additionally, running reporting queries on a dataset of large size can be taxing on the storage system or database. Traditional methods for solving this data overflow problem have been to improve the quantity or quality of the hardware that hosts the storage system or to randomly drop whatever information cannot be handled.