The problem of anomaly detection has great practical significance, manifesting in a wide variety of application domains, including detection of suspicious/anomalous events in Internet traffic, in human behavior and/or entity tracking, host-based computer intrusion detection, detection of equipment or complex system failures, as well as of anomalous measurements in scientific experiments, which may indicate either equipment glitches or interesting phenomena. There is substantial prior literature on anomaly detection for network intrusion detection, e.g. [25][9], based on numerous proposed statistical tests and heuristic criteria. Most such approaches will only be effective in detecting specific types of anomalies, within particular networking domains.
The scenario addressed here is the detection of anomalies amongst the T samples in a collected data batch, X={xi: i=1, . . . , T, xi in RN}. The batch may consist, for example, of the samples collected over a fixed window of time (during which, in the absence of anomalies, these “normal” (known nominal and/or known attack) samples are all expected to follow the same probability law (even if this law is not explicitly known)). We assume there is a database exclusively containing “normal” examples that can potentially be leveraged both for learning the “normal” probability model (either on the full N-dimensional space or on lower-dimensional subspaces) and possibly for assessing statistical significance of detected anomalies (measuring empirical p-values (by a p-value, we mean the probability of making an observation more extreme than a given observation, under an assumed probability law)).
There are several reasons why N may be large. First, some applications are inherently high dimensional, with many (raw) features measured. Second, large N may enable greater anomaly detection power. In supervised classification, it may be possible to discriminate known classes using a small number of (judiciously chosen) features that have good (collective) discrimination power. However, anomaly detection is inherently unsupervised—there are generally no anomalous “examples” and no prior knowledge on which subset of raw (and/or derived) features may best elicit anomalies. This suggests use of more features increases the likelihood that a sample will manifest a detectable effect. Moreover, for malicious network traffic packet flows that mimic Web application flows to evade detection. more—rather than fewer—features may typically be required to detect “evasive” anomalies.
Two standard anomaly detection strategies are: 1) Applying a single test, based on the joint density function defined on the full N-dimensional feature space; 2) Applying multiple tests, e.g. tests on all single feature and all pairwise feature densities, with the (highest priority) detected anomaly the sample yielding the smallest p-value over all the tests. There are two problems with 1). First, if N is large relative to the size of the database of “normal” exemplars, the estimation of the joint density function will be inaccurate (i.e., there is a curse of dimensionality). Second, suppose that the features are statistically independent and that the anomaly only manifests in one (or a small number) of the features. In this case, the joint log-likelihood is the sum of the marginal (single feature) log-likelihoods, and the effect of a single (anomalous) feature on the joint log-likelihood diminishes with increasing N.
There are also difficulties with 2). First, there is the complexity associated with using a number of tests combinatoric (e.g., quadratic) in N. However, even ignoring complexity, use of many tests may unduly increase the number of false alarms. To give a hint of this, suppose that there is a single anomaly in the batch, with the anomaly detected by only one of the K=N+N(N−1)/2 single and pairwise tests, with p-value p. Assuming that the tests are independent, the probability that no other sample will have a smaller p-value (and will thus be falsely detected first, prior to detecting the anomaly), given p, is (1−p)K(T−1), i.e., it is exponentially decreasing in KT. Supposing p=10−5, this probability is about 0.9 for KT=104, and it is vanishing by KT=106.
For high (N)-dimensional feature spaces, we consider detection of anomalous samples amongst a batch of collected samples (of size 7), under the null hypothesis that all samples follow the same probability law. Since the features which will best identify possible anomalies are a priori unknown, two common detection strategies are: 1) evaluating atypicality of a sample (its p-value) based on the null distribution defined on the full N-dimensional feature space; and 2) considering a (combinatoric) set of low order distributions, e.g. all singletons and all feature pairs, with detections made based on the smallest p-value yielded over all such low order tests. Approach 1 (in some cases, unrealistically) relies on accurate knowledge/estimation of the joint distribution, while 2) suffers from increased false alarm rates as N and T grow.
Messages over the Internet, e.g., emails or web content, are segmented into Internet Protocol (IP) packets by software in the end-systems. Packets or datagrams are payload (message) and header, where the header has the necessary addressing information so that Internet devices (routers, switches) can forward it to its intended destination. The 32-bit binary IP addresses are obtained from the familiar domain name through the Internet's DNS name resolution system, e.g., my mail server mail.engr.psu.edu has IP address 130.203.201.3 where, e.g., (decimal)130=(binary)10000010. The payloads of IP packets, containing the message segments, are typically no longer than about 1500 bytes each. IP packets have addressing information in their headers, including the IP five-tuple:                32-bit source (return) and destination IP addresses,        16-bit source and destination port numbers, e.g., one could indicate the packet is part of an http (web) message (port=80), and the other would identify the specific application on the client end-system that initiated the web session.        The type of layer-4 play, i.e., for a web application it would be TCP.        
Software (TCP) in the end-systems adapts messaging to the Internet, including message segmentation into packets at sender and reassembly at receiver. Reassembly is based on sequence numbers in the TCP (layer 4) portion of the packet headers. Cisco's Netflow software will process a trace of observed packets (in TCPdump format) to extract flow (or session) level features, where a flow is a group of packets with common IP 5-tuple (sIP:sport, dIP:dport, protocol) that are proximal in time (such that a time-out threshold is not tripped). For example, if an email is sent to a person's Yahoo email account from my Penn State engineering webmail account, the packets constituting my email would be proximal in time with:                sIP=130.203.201.3 (my mail server's 32-bit IP address)        dIP=217.146.187.123 (that of mail.yahoo.com)        sport=21443 (a random number for the session chosen between 210 and 216)        dport=25 (SMTP, meaning email)        protocol=TCP        
Companies and government agencies are interested in monitoring the on-line activities of their employees to protect against:                Intrusions seeking to steal/exfiltrate sensitive data (e.g., Wikileaks), damage the targeted network, or deny access to critical resources (denial of service/access to bandwidth, data or computational resources).        Insider attacks with similar goals targeting unauthorized handling, integrity, and/or availability of network services/resources, data or documents.        Loss of productivity via potentially frivolous network activity while at work, such as facebook, BitTorrent, youtube, etc.        
In the West, such monitoring is common in not-public-commodity (e.g., private enterprise) IP networking contexts. Such monitoring, achieved through direct examination of transmitted packets in-flight in the network (part of a network-based intrusion detection system (NIDS)), extends that achieved by firewalls of “stateful” (inter-packet) deterministic signatures, and complements packet-level signature checking, host-based intrusion-detection systems (HIDS), protocol anomaly detection (PAD) systems, etc.
Deterministic signatures are based on known attacks. Even deterministic signatures are reluctantly disclosed because of a desire for covert defense, i.e., one does not want a continuing attacker to know the signature and then to modify the attack to evade it. Complex nominal background and attack traffic is difficult to succinctly characterize in very high-dimensional feature space; also, it is time-varying (beyond time-of-day effects and the like) and domain dependent. So, potentially threatening anomalies are difficult to confidently characterize.
In the past, the type of session (e.g., web, VoIP, email) was largely conveyed by the (“well known”, <1024) server port number (=dport of client-transmitted packet in a client-server session). Such port numbers are easily spoofed. Alternatively, deep packet inspection (DPI) of the payloads can convey flow-type information, e.g., a URL (web address) in an HTTP header indicates a web app. Increasingly, these methods are too crude and unreliable in the presence of flow-type obfuscation such as payload encryption, evolving attack vectors, and the running of custom flow/congestion control over UDP (e.g., utorrent), which obfuscates the “protocol” type.
Generally, the usability/security trade-off of the target operating environment is often not clearly defined. The IDS needs to be initially calibrated to its operating environment upon deployment (which can be done inexpensively and accurately via our anomaly detection and domain adaptation systems, as we describe below). The IDS needs to be continuously recalibrated thereafter to track evolving attacks and changes to nominal background traffic (in addition to known adaptations to time-of-day, day-of-week, etc.). Also, there is a basic need for anomaly detection of “unknown unknowns” and to explore mock attacks that are variations of known attacks, i.e., “known unknowns”.
The need for domain adaption is motivated by the not surprising fact that a classifier (or anomaly detection system) trained at one physical port may perform poorly if tested at a different one (similarly, for time-of-day differences between training and test sets even at the same physical port). This phenomenon has been observed between different domains with the same traffic mix, thus motivating the need for inexpensive domain adaptation, instead of expensive retraining from scratch (requiring a large pool of labeled training examples) on the target domain.
A statistical classifier is a function or a rule which takes an input pattern called a feature vector and returns one of K possible outputs, called classes. The classifier's decision rule is learned using examples of the (input pattern, class label) pair, such that the learned decision rule is able to classify the labeled examples accurately, and also to generalize well to new patterns (feature vectors) which are generated from the same underlying probability distribution. The labeled examples are called training data. Determining these (ground-truth) training labels usually requires effort, expense, and potentially human labor and expertise. The learning of a classifier in this case is called supervised learning because all the examples used for learning have class labels.
Another common framework called unsupervised learning aims to learn the underlying structure of the data (input patterns) in the absence of any supervision information (usually class labels). An important problem in unsupervised learning is clustering, where the data is partitioned into groups or clusters whose members are similar to each other, and different from the members of other clusters. There are many popular methods for clustering such as the K-means algorithm, hierarchical clustering, and mixture model based clustering. At this point it is useful to make the distinction between classes, clusters, and mixture components. The clusters or groups of data obtained from a clustering method do not take class labels into consideration, and hence a cluster may contain data points from multiple classes, in different proportions. In mixture model based clustering the joint probability distribution of the features is modeled with a mixture of parametric distributions (e.g. Gaussian, Bernoulli, exponential). The individual components of the mixture model are conceptually similar to clusters, with (data) points probabilistically assigned to individual components based on how well the components explain their stochastic generation. When a cluster or component has all of its data from the same class, it is said to be class-pure, whereas in general there is a probability distribution of the class within each mixture component or cluster.
In between the above two frameworks is semisupervised learning, in which only a small portion of the data examples possess class labels, while the rest are unlabeled. In many practical applications, unlabeled samples (data) can be easily collected, while only a limited number of labeled samples are available. Since labeled data may be difficult or time-consuming to obtain, machine learning techniques which make effective use of both labeled and unlabeled data have received considerable recent attention. In the most common objective for semisupervised learning, a small set of labeled samples is augmented with a large number of unlabeled samples, with the aim of learning a better classifier than that learnable using the labeled samples alone. There is potential for this because the unlabeled samples can help to learn a more accurate model for the data distribution, which in turn improves the model of the class posterior distribution of the classifier.
In both supervised and semisupervised classification frameworks, it is assumed that the distribution of the data on which the classifier is trained is the same as the distribution of the data on which it is deployed to make predictions. This assumption may not be true in some situations, and we may be required to train a classifier on labeled data from one domain and apply it to predict on data from a different domain, where the underlying data distribution is different, but not drastically different. Also, obtaining labeled data in this new domain may be difficult and/or expensive, whereas in general there will be plenty of unlabeled data available. This problem arises when there is a contextual difference in the way the data is generated in the two domains. For example, we may have a classifier trained on labeled data obtained at a particular time, or at a particular location, and we may want to be able to make accurate predictions on data obtained at different times or at different locations, for which labeled examples are not available. Also, there could be situations where the distribution of features in the two domains changes conditioned on the value of a latent variable and the class. For example, in a network scenario, the demand profile for a certain type of traffic flow could be different at two sites, domains, or time-of-day, such that the training data was captured when the demand (total bytes or packets) was low, while the test data is captured when the demand is high. In this case the latent variable is a binary valued indicator taking on values “high” or “low” (which may signify different times of day, with different traffic demands).
In such cases, directly porting a classifier may give poor results and is not a good idea. At the same time, we want to be able to make use of the labeled data and the classifier available from the different, but related (training) domain. This is known as the classifier domain adaptation problem, which has recently received a lot of recent attention, particularly from the text, natural language processing, and remote sensing communities. Following the terminology in the literature, we refer to the domain where the classifier is trained using labeled data as the source or training domain, and the domain to which we want to adapt the classifier as the target or test domain. Domain adaptation methods can be categorized as semisupervised or unsupervised depending on whether a small amount of labeled data from the target domain is available or not.