An in-depth understanding of data profiles of Internet traffic is a challenging task for researchers, and is a requirement for most Internet Service Providers (ISPs). By applying Deep Packet Inspection (DPI) to Internet data traffic, in-depth information of said Internet data traffic is gained. This information is valuable in the quest for profiling networked applications by, for instance, ISPs. Having this information, ISPs may then apply differentiated charging policies, traffic shaping, and offer differentiated quality of service guarantees to selected users or applications.
Critical network services may rely on the inspection of the payload of data packets. Since payload inspection of data is time consuming, payload inspection may not be well suited for real-time data flows.
Looking at structured information found in packet headers provides a fast alternative to payload inspection, and may be well suited for certain use cases, for instance real-time data flows.
Clustering of data within machine learning may be considered to comprise two phases, one training phase and one testing phase.
FIG. 1, schematically presents a known training phase of a clustering method within machine learning. The training phase determines one or more clustering methods based on known data traffic (e.g., labeled data traffic).
The input of the training phase of FIG. 1 is labeled data traffic 102, and the output of said training phase is clustering models 110. The labeled data traffic 102 typically comprises data traffic of known categories, such as peer-to-peer (P2P) and Voice over Internet Protocol (VoIP), to mention two examples only. At 104 descriptors of said data are calculated. Examples of descriptors of said data are average payload size of a data flow and a measure of the distribution of the payload size, such as the deviation of payload size.
From said data descriptors 106, a model creation is then performed at 108, whereby the clustering models 110 are obtained.
These models have thus been calculated to be able to identify the category of data that is input to the training phase of the data clustering. Subsequently, in the testing phase these models will be tested by using unlabeled data traffic.
FIG. 2 schematically presents a known testing phase of a clustering method within machine learning. Input to this testing phase is thus unlabeled data traffic 202, that is, data traffic of unknown categories. Output of this testing phase is models having the best fitting to the unlabeled data traffic. The best fitting models will provide a reliable description of the unlabeled data traffic.
At 204 data flows are identified and data descriptors 206 of said flows calculated. Based on said data descriptors 206 and by using available models 208 as obtained from the training phase and loaded into the testing phase, each model is evaluated 210, whereby fitting models 212 may be obtained. Evaluation of each tested method may comprise determination of values of fitting parameters, as a measure of the degree of how well each tested model fits the un-labeled input data traffic.
For instance, the fitting parameters may comprise five fitting parameters that have been determined as exactly as possible along with a confidence interval.
In the field of data clustering or data classification, a huge number of publications are presented. Most publications relate to algorithms that are applied on a flow level, and relatively few relate to algorithms that are applied on a packet level.
“Real-time classification for encrypted traffic”, in SEA, 2010, pp. 373-385, Bar-Yanai, R., et al. presents a hybrid clustering method for applications clustered in overlapping clusters by using a k-means measure and a k-nearest neighbor measure.
State-of-the-art model creation methods that operate during data traffic clustering often rely on data clusters that are determined within said method.
Feature reduction algorithms are also known. These are however solely focused on gaining more information.
Although a number of methods have been published, they suffer from different limitations and/or drawbacks.
There is hence a need to overcome said limitations and/or drawbacks of known methods.