Intrusion detection for cyber defense of networks (defense of computer networks against cyber attacks) has two main pitfalls that result in malicious penetration: computer networks are always evolving, resulting in new, unknown vulnerabilities that are subject to new attacks, and hackers are always obfuscating known (legacy) attack delivery methods to bypass security mechanisms. Intrusion detection sensors that utilize machine learning models are now being deployed to identify these new attacks and these obfuscated legacy attack delivery methods (Peter, M., Sabu, T., Zakirul, B., Ryan, K., Robin, D., & Jose, C., “Security in Computing and Communications”, 4th International Symposium, SSCC 2016 (p. 400), Jaipur, India: Springer (2016), incorporated herein by reference, and U.S. Pat. No. 8,887,285 to Resurgo entitled “Heterogeneous Sensors for Network Defense”, incorporated herein by reference).
State-of-the-art machine learning model evaluation uses statistical analysis to determine model fit for a particular set of data. This works well for typical uses of machine learning (e.g. speech recognition, weather forecasting), but fails to meet cyber defense standards when using machine learning models to detect cyber attacks on networks. Currently, statistical analysis of cyber defense models (machine learning models for defense against cyber attacks) does not test for obfuscated attacks, and is only applied to archived data sets, not to (1) real-time, evolving network traffic; or (2) real-time attack detection.
A. Background of Obfuscation to Thwart Cyber Defense
Cyber attacks, such as malware, should not be thought of as a single unit based on result. Instead, cyber attacks can be broken into their functional components; propagation (i.e. attack delivery) method, exploit, and payload (Berr, T., “PrEP: A Framework for Malware & Cyber Weapons”, George Washington University, Political Science Department & Cyber Security and Policy Research Institute, Washington D.C. (2014), incorporated herein by reference). The propagation (attack delivery) method is the means of transporting malicious code from origin to target. The exploit component takes advantage of vulnerabilities in the target system to enable infection and the operation of the payload. The payload is code written to achieve some desired malicious end, such as deleting data. Using this three-part component framework for analyzing cyber attacks becomes important when trying to detect and protect against cyber attacks. Many signatures (for signature-based intrusion detection systems) are based on the payload portion of the attack, because the propagation method and exploit components can vary substantially from target to target.
U.S. Pat. No. 8,887,285 to Resurgo, incorporated herein by reference, discloses how to combine signature-based and machine learning sensors for a more comprehensive computer network defense. This patent details how to build a data set using detection evasion techniques to train the machine learning sensor to cover the “blind-spot” of the signature-based sensor.
Detection evasion techniques in all fields (not just network security) can include obfuscation, fragmentation, encryption, other ways to change form or add variant forms (sometimes called polymorphous or polymorphic), and other detection evasion techniques, and are all referred to in this specification and claims collectively and singly as “obfuscating”, “obfuscation” or “obfuscated attacks”.
Detection evasion techniques as applied to network security were described in more detail in the Resurgo patent's background section:
The blind spot problem for signature-based sensors is compounded by the fact that use of evasion techniques by hackers has proven very effective at enabling known exploits to escape detection. Evasion techniques allow a hacker to sufficiently modify the pattern of an attack so that the signature will fail to produce a match (during intrusion detection). The most common evasion techniques are obfuscation, fragmentation, and encryption. Obfuscation is hiding intended meaning in communication, making communication confusing, willfully ambiguous, and harder to interpret. In network security, obfuscation refers to methods used to obscure an attack pay load from inspection by network protection systems. For instance, an attack payload can be hidden in web protocol traffic. Fragmentation is breaking a data stream into segments and sending those segments out of order through the computer network. The segments are reassembled in correct order at the receiving side. The shuffling of the order of data stream segments can change the known attack signature due to the reordering of communication bits. Encryption is the process of encoding messages (or information) in such a way that eavesdroppers or hackers cannot read it, but that authorized parties can. Both the authorized sender and receiver must have an encryption key and a decryption key in order to encode and decode communication. In network attacks, the attack payload can often be encoded/encrypted such that the signature is no longer readable by detection systems. While each evasion technique chancres the attack pattern differently, it is important to note that the goal is the same: change the attack pattern enough to no longer match published attack signatures and hence to avoid intrusion detection.
Hackers use the above methods, among others, to vary the propagation method, exploit, and payload components to create obfuscated attacks, even though the name of the cyber attack, and its result, may be the same. The process of U.S. Pat. No. 8,887,285 focuses on the training of machine learning models with obfuscated attacks, but does not consider any methods for evaluating the resulting models beyond standard statistical techniques.
U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC, incorporated herein by reference, and provisional patent application No. 61/872,047 (from which U.S. Pat. No. 9,497,204 B2 claims priority), incorporated herein by reference, disclose a send-supervised learning module connected to a network node. The learning module uses labeled and unlabeled data to train a semi-supervised machine learning sensor. However, semi-supervised learning is a model training technique and not appropriate for model performance evaluation or identifying obfuscated attacks.
B. Background of Machine Learning
Learning models, statistical models, analytical models and essentially all mathematical models can be used to explain, predict, automate, and analyze information about the nature of things. Fields of study such as machine learning, statistical analysis, and pattern recognition are actively researching new ways to process and handle datasets. To develop these models, a sufficient supply of samples which have been (to the best of ability) properly labeled (i.e. classified) must be supplied. The correctly labeled (i.e. classified) samples are known as the “ground truths”. These samples do not change, so they are static. For machine learning purposes, these samples are not sorted or segregated by their classifications; a set of samples includes both normal network traffic and cyber attacks. The samples of normal network traffic preferably are obtained from the network that is to be protected. The samples of cyber attacks are preferably also obtained from the network that is to be protected, but if there is an insufficient number of samples of cyber attacks on that network, then some or all of the samples of cyber attacks can be provided from an existing repository of cyber attacks.
At a basic level, samples of ground truths can be split into two different datasets: training set and test set. The training set is a sufficiently large number of samples, which contains a sufficiently large number of ground truths, used to train (i.e. generate models using some model generating algorithm, and then select from those models). The training set is used to tune the parameters of a model generation algorithm (Bishop, C. (n.d.), “Pattern Recognition and Machine Learning”, Mew York: Springer, incorporated herein by reference). In machine learning nomenclature, this is known as the training phase, or learning phase. This training phase involves providing a modal generating algorithm with the training data; generating multiple models (that is, generating multiple models using multiple different sets of tuned parameters) that each segregate the training data by label or classification; performing a statistical analysis on the performance of each model (or set of parameters) to determine whether the training data has been accurately segregated; and then selecting the model (or set of parameters) that provides the most accurate segregation, which becomes the trained model.
The multiple different sets of timed parameters used to generate multiple models are preferably generated by methods such as exhaustive cross-validation, grid search K-fold cross validation, leave-p-out cross-validation, leave-one-out cross-validation, k-fold cross-validation, or repeated random sub-sampling validation.
The training data must have a sufficient number of samples with ground truths (having correct labels or classifications) to train a model that performs well in supervised learning. This is a true statement for all machine learning, however some training techniques try to utilize unlabeled data. Unlabeled data is simply data that has not been classified by an expert and is therefore unknown in content. Supplementing ground truth data with unknown data in the training phase can be a very cost effective approach because creating or classifying ground truths can be an exhaustive process. This approach, called semi-supervised learning, is part of the training phase of machine learning and also disclosed in U.S. Pat. No. 9,497,204 B2 to UT Battelle, LLC, incorporated herein by reference, and U.S. provisional patent application No. 61/872,047 (from which U.S. Pat. No. 9,497,204 B2 claims priority), incorporated herein by reference. Machine learning utilizing only unlabeled data (i.e. no ground truths), also known as unsupervised learning, is closer to anomaly detection but still can be referred to as machine learning. No matter what the percentage of ground truth is used in training, the algorithms work to classify the data based on optimizing parameters.
Different model generating algorithms generate models in their own unique ways, such as support vector machine learning algorithms. A support vector machine learning algorithm labels or classifies training data by finding the best separating hyperplane (where hyperplane is a point, line, plane, or volume, depending on the number of dimensions) in a multidimensional space into which the training data has been mapped, separating the space into two or more parts, with each part corresponding to a label or class of the ground truths of the training data. Optionally, support vector machine learning algorithms have tunable parameters which can be adjusted to alter how ambiguous data points are labelled or classified. A support vector machine learning algorithm generates a mathematical model which can assign or determine classifications of new data based on the classifications determined during training (Burges, C., “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, 121-167 (1998), incorporated herein by reference). Other model generating algorithms, such as Logistic Regression, Decision Trees, Naive Bayes, Random Forests, Adaboost, Neural Networks, and K-Means Clustering, similarly generate models that segregate training data into patterns, which can allow labeling or classification of data.
In machine learning, a trained algorithm will output a model that is useful to make better decisions or predictions, discover patterns in data, or describe the nature of something. Generally, the model generation algorithms are designed to minimize an error equation or optimize some other model parameters. The error rate of a model evaluated on a training set is known as the training error.
The second set of samples containing ground truths is referred to as the test set. The test set may include the remaining ground truths that were not used in the training set, or only a random portion of the remaining ground truths. The model's performance is evaluated on the test set by measuring the error of the test set. This is known as test error (Friedman, J., Tibshirani, R., & Hastie, T., “The Elements of Statistical Learning”, New York: Springer (2001), incorporated herein by reference). The test error of a model is determined by the same methods as the training error. A test set is important because it determines the model's ability to perform on new data (data that did not train the model). In most applications, comparing test error of different generated models allows selecting an optimal model for real-world application.
C. Background of Machine Learning in Cyber Defense
Once cyber defense machine learning models/sensors graduate from the previously described static data testing, they are inserted onto actual computer networks that need cyber defense. At this point, the statistical predictive or determinative calculations that were used previously no longer apply. The predictive or determinative calculations of training errors and test errors only work with a given data set containing ground truths. Actual computer network traffic has no ground truths (because it has not been previously classified) or answer keys, and therefore no means by which to assess the fit of the machine learning model. Model fit testing (i.e. goodness of fit) is still required though, because cyber networks can change overnight or over a prolonged timeframe, thus requiring periodic model retraining. In machine learning, model retraining is effectively starting from scratch, and requires building a brand new data set with ground truths (from which new training data and new test data can be obtained) for training and testing, which can be so burdensome as to prohibit using machine learning sensors at all.
Model fit, referred to as goodness of fit, is how well a trained model does its designed task (e.g. predicts, determines, decides, classifies, or any other model performing function). Model fit traditionally has been calculated with different techniques such as chi-squared test, Kolmogorov-Smirnov test, coefficient of determination, lack of fit sum of squares, etc. The essence of these techniques is to obtain a value that describes the model fit and to compare that to some desired confidence value (i.e. a model fit value that is better than the desired confidence value is determined to be a good fit for the given data). Model fit can also be determined by the test set error relative to some desired confidence value (i.e. if the model produces a low test error when given a test set, then the model may be a good fit). Of course, model fit has to be recalculated for any new data with ground truths because model fit is relative to a static data set.
In current practice, model fit analysis on live networks does not take place, and model retraining is triggered by human subjective assessment based on observables, including the following:
Machine learning model output (i.e. attack alert log) quantity becomes too great for human analysis. In other words, the model alerts on attacks (or false attacks) too often for human analysts to be able to research or react to every alert. Model retraining is implemented to reduce the quantity of alerts.
Forensic analysis of the network traffic (sometimes initiated by the machine learning model's alert logs) determines that the machine learning model had false-positives. In sensing, false-positive alerts are those in which the sensor (or model) indicates an event when none really occurred. Model retraining is enacted as a reactionary step to reduce the quantity of false-positives.
Forensic analysis of the network traffic (sometimes initiated by a signature-based sensor alert or by a network disruption) determines that the machine learning model had false-negatives. In sensing, false-negatives are a lack of alert when an event actually occurred. Model retraining is enacted as a reactionary step to reduce the quantity of false-negatives.
These three current methods for determining the need for model retraining are based on human subjective assessment and not on actual model fit to the data, (actual network traffic). As a result, decisions are being made in machine learning cyber defense without appropriate context and without knowledge of the cyber defense implications. For example, if hackers know a network is employing machine learning sensors, they could create attacks that generate a large quantity of alerts. The sudden increase and sheer quantity of alerts would cause the analysts to request model retraining while they either ignore the machine learning alerts or set them aside for retroactive, forensic analysis. With the machine learning sensing being ignored, the hackers could then be free to go after their primary goal, until the model has been retrained and reinstalled in the network.
D. Need for a New Validation Test for Pre-Deployment (i.e. Laboratory) Model Fit Testing
Because most cyber network attacks from sophisticated actors (i.e. the really dangerous hackers) are actually the obfuscated type (Du, H., “Probabilistic Modeling and Inference for Obfuscated Network Attack Sequences”, Thesis, Rochester Institute of Technology (2014), incorporated herein by reference), it was realized that conventional methods of model evaluation, training and test errors, were inadequate for intrusion detection. Conventional use of training error's and test errors only considers trained (attacks based on training data) and untrained attacks (attacks not based on training data), while cyber defense model fit testing must include trained, untrained, and obfuscated attacks (attacks based on training data that has been obfuscated), for a more accurate prediction of model performance on actual cyber networks.
Conventional model testing considers obfuscated attacks to be part of untrained test data. However, this skews test error results because these attacks were not truly untrained data points. Effectively, the attacks were indeed used in model training, just not those obfuscated instances of the attacks. An example: a buffer overflow of a website input field is used in training, but an obfuscated version uses fragmentation of the communication stream to reorder the packets of the very same attack. Conventional model testing would consider the obfuscated attack as part of untrained (or new) data. However this categorization is not valid for cyber network defense because hackers have countless methods for obfuscating known attacks: the known attacks can take many different forms, that is, they can be polymorphous or polymorphic. In cyber intrusion detection, machine learning model fit analysis needs to consider training errors and test errors for the following types of attacks:
Known attacks contained in training data;
Zero-day attacks or unknown attacks, unrelated to training data; and
Obfuscated (including polymorphous or polymorphic) attacks that are similar in propagation method, exploit, or payload components to training data. Example: a trained attack contains a website user input field (propagation method), buffer overflow (exploit), and a trojan (payload); an obfuscated attack could contain the same website user input field and buffer overflow exploit, but contain a different payload such as a virus.