This invention relates to automatic profiling of network event sequences.
The behavior of network objects, such as flows, sessions, hosts, and end users, can often be described by sequences of communication events in the time domain. Understanding the behavior of networking objects such as traffic flows, sessions, hosts, or users are essential in many applications of network measurement and monitoring. Such behavior can often be described by event sequences, where by event sequence refers to a series of events that i) they are affiliated with the same entity, ii) each event is identified by a symbol, iii) symbols take limited discrete values. While many event sequences can be found in a variety of networking scenarios, three example cases include:
TCP SYN/FIN/RST sequences: The TCP protocol signals the start and the end of a TCP connection with packets that are distinguished by flags in the header. The first packet has a SYN flag set; the last usually has the FIN flag set. A TCP connection can also be terminated by a packet with the RST flag set in the header. The arrival of SYN, FIN and RST in a TCP connection forms an event sequence.
SIP-based VoIP call sessions: SIP (Session Initiation Protocol) is the defactor signaling protocol for VoIP services. Because VoIP relies on SIP to setup and tear down call sessions, each session contains SIP control messages like INVITE, ACK and BYE. Such control messages in each session form an event sequence.
Wi-Fi user sessions: In Wi-Fi wireless networks, a user needs to establish a wireless connection with the nearby access point (AP) to access the Internet. In this case, a user session can be defined as the duration between the user joining and leaving the wireless network. A user session is established, secured and terminated through control message exchange between the user and AP. All such messages within a user session become an event sequence.
As a common problem in network measurement and monitoring tasks, operators often seek simple and effective solutions to understand the diverse behavior hidden in a large amount of event sequences—including both the massive normal behavior and potentially the behavior of a small proportion of anomalies.
The system provides for a complete profiling of massive event sequences. Accurate yet information-compact profiling of such event sequences is critical for many network measurement and monitoring tasks. The system can handle the multi-dimensional behavior exhibited by event sequences. That is, the sequence behavior can not be fully described by a single variable or distribution. Instead, it possesses at least two types of important properties: sequential patterns constituted by symbols, and duration between events. In practice, even one type of properties might be difficult to be described precisely. E.g., although the aforementioned TCP SYN/FIN/RST sequence has only three discrete symbols, the system can handle a large number of patterns with the longest pattern having in excess of 100 symbols (based on a trace collected at a gateway router). For duration related applications, the system can provide a precise profiling. Taking the VoIP sequence in FIG. 1 as an example: the duration between ACK and BYE is the actual call duration. This duration is heavy-tailed and ranges from 0 second to more than 2 hours. When both sequential patterns and duration are concerned, the complexity is conceivably much higher.
There are a large number of prior work on modeling or mining sequence-alike data in areas as diverse as speech recognition, bio-informatics, database, database mining and system. In speech recognition field, applied hidden Markov models (HMM) to cluster a string of acoustic units. In database and database mining field, mining sequential patterns have been studied intensively in the past years. Many of these work focus on discovering rules instead of modeling. For example, Internet users' navigation records are modeled by using a mixture model consisting of first-order Markov chains. In system areas, a relevant research direction is to detect anomalies (or intrusion) via mining sequential system states such as system calls.
In networking areas there are few studies on sequence-alike data analysis and applications. Some recent efforts are devoted to inferring properties of traffic flow. Recently, certain work applies various data mining techniques to identify significant patterns or insignificant anomalies from various traffic entities such as flows, Internet backbone traffic and host communication patterns.