Machine learning is a technique that uses the high speed processing power of modern computers to execute algorithms to learn predictors of behavior or characteristics of data. Machine learning techniques may execute algorithms on a set of training samples (a training set) with a known class or label, such as a set of files known to exhibit malicious or benign behaviors, to learn characteristics that will predict the behavior or characteristics of unknown things, such as whether unknown files are malicious or benign.
Many current approaches to machine learning use algorithms that require a static training set. Such machine learning approaches using algorithms that require a static training set (such as those based on decision trees) assume that all training samples are available at training time. There exists a class of supervised machine learning algorithms known as on-line or continuous learning algorithms that update the model on each new sample. However, these algorithms assume each new sample will be classified by an expert user.
A relevant machine learning method is batch mode active learning (BMAL). BMAL constructs a new classifier that is retrained based on a batch of new samples in an optionally repeatable process. BMAL, however focuses on the selection of unlabeled samples to present to the user for adjudication. BMAL conducts repeated learning until some objective performance criteria is met. Additionally, BMAL does not cover the case where the training data is split between multiple locations where original training and test data must be sent to the user where new samples are added.
Other relevant prior art methods are described in the following patents and published applications. For example, U.S. Pat. No. 6,513,025 (“the '025 patent”), entitled “Multistage Machine Learning Process,” involves partitioning of training sets by time intervals and generating multiple classifiers (one for each interval). Time intervals are cyclic/periodic (fixed frequency) in the preferred embodiment. The '025 patent leverages dependability models (method for selecting which classifier model to use based on the system input) to determine which classifier to use. In addition, the classifier update and training sample addition methods in this patent are continuous. The '025 patent is also limited to telecommunications network lines.
U.S. Pre-Grant Publication No. 20150067857 (“the '857 publication”) is directed towards an “In-situ Trainable Intrusion Detection System.” The described system in the '857 publication is based on semi-supervised learning (uses some unlabeled samples). Learning is based on network traffic patterns (netflow or other flow metadata) not files. The '857 publication uses a Laplacian Regularized Least Squares learner and does not include a method for allowing users to select between classifiers or view analysis of performance of multiple classifiers. The '857 publication also only uses in-situ samples (samples from the client enterprise).
U.S. Pre-Grant Publication No. 20100293117 (“the '117 publication”) entitled “Method and System for Facilitating Batch Mode Active Learning,” discloses a method for selecting documents to include in a training set based on an estimate of the “reward” gained by including each sample in the training set (estimate of performance increase). The reward can be based on an uncertainty associated with an unlabeled document or document length. The '117 publication does not disclose detecting malicious software or files. U.S. Pre-Grant Publication No. 20120310864 (“the '864 publication”), “Adaptive Batch Mode Active Learning for Evolving a Classifier,” focuses on applying this technique to image, audio, and text data (not binary files and not for the purpose of malware detection). Moreover, the '864 publication requires the definition of a stop criterion which is typically based on a predetermined desired level of performance. Importantly, the '864 publication method lacks accommodations for in-situ learning such as the potential need to provide a partial training corpus, expressing that corpus as feature vectors instead of maintaining the full sample, etc.
Existing machine learning techniques disclose learning algorithms and processes but do not cover the method for augmenting or retraining a classifier based on data not accessible to the original trainer. Existing machine learning techniques do not enable training on data samples that an end user does not wish to disclose to a 3rd-party which was originally responsible for conducting the machine learning.
Additionally, prior art malware sensors have an inherent problem in which each instance of a malware sensor (anti-virus, IDS, etc.) is identical provided their signatures or rule sets are kept up-to-date. In such instances, since each deployment of a cyber-defense sensor is identical, a bad actor or malware author may acquire the sensor and then test and alter their malware until it is not detected. This would make all such sensors vulnerable.