Computer networks are infected by malware, and as the variability of malware samples has been rapidly increasing over the last years, existing signature-based security devices, firewalls, or anti-virus solutions provide only partial protection against these threats. The ability to detect new variants and modifications of existing malware is becoming very important. Machine learning is beginning to be successfully applied to complement signature-based devices.
In statistical machine learning, real-valued features extracted from data are used to construct representations that enable training data-driven classifiers. For example, when classifying network traffic, the features can be extracted from individual connections (flows) or from groups of flows as determined by communication of a user to a domain in a predefined time window. Data-driven classifiers are traditionally based on a manually predefined representation (i.e., feature vectors representing legitimate and malicious communication). Since the accuracy of the classifiers directly depends on the feature vectors, manually predefining the representation is not optimal.