The present specification generally relates to improvements in relation to machine learning based malware detection.
Nowadays, malware (“malicious software”) scanning is a vital issue in any kind of networks, and is generally directed to identify (and potentially also disinfect) any kind of malware on computer and/or communication systems, such as e.g. viruses, Trojans, worms, or the like. Malware scanning techniques include, for example, signature based scanning and heuristic based scanning.
For signature based techniques, once a malware is identified, it is analyzed and a proper distinctive signature of the file is extracted and added to a signatures database of a malware detection/protection system.
For heuristic based techniques, a generic signature or any other suitable feature combination common for a group of malware variants and distinguishing from non-malicious software is determined and it is expected that such feature combinations are generic and flexible such that also detection of yet unknown malware is enabled to a certain extent.
The present specification relates to scenarios in which a machine learning model is used to detect maliciousness of incoming, previously unseen objects, and where the performance of the model needs to be monitored.
For such an approach, a model is trained over some pre-existing data (“training data” or “training set”). This trained model is deployed to produce predictions about new relevant objects. The performance of the model needs to be controlled continuously over time, because, for example, threats evolve and training sets are not perfect (e.g., might be biased). To maintain the required level of performances, the training set needs continuous maintenance and the model needs periodical re-training.
When designing machine learning based systems for security, particularly those that aim to detect unknown, previously unseen malicious objects (“malware”), it becomes evident that maintaining such learned models and guaranteeing the quality of their decisions is not a trivial issue.
Prior art which relates to this field can be found in document EP 08 97 566 B1, disclosing monitoring and retraining a neural network.
According to this document, processing mobile operators' data for detecting “anomalous” instances (events, states and so forth) that indicate potential frauds of phones and their identifiers, bank cards etc. is addressed. In particular, a thorough review of main problems in this area is provided. This document particularly focuses on training (or updating) a new model of the same topology while the old one is still functioning, serialization of existing models, and using a persistence mechanism for keeping their state. That is, a way on how to make neural network based models' retraining fast and seamless is proposed.
Further prior art which relates to this field can be found in document US 2015 03 55 901 A1, disclosing a method and a system to automate the maintenance of data-driven analytic models.
According to this document, it is identified that a data-driven analytic model tends to misbehave, an estimate of useful time of live for the model is forecasted, and the model is modified to accommodate the noticed misbehavior on the basis of caught anomalies in controlled models' output characteristics.
Further prior art which relates to this field can be found in document US 2015 00 74 023 A1, disclosing an unsupervised behavior learning system and a method for predicting performance anomalies in distributed computing infrastructures.
According to this document, anomalies in environments that provide infrastructure as a service (IaaS) are predicted. To this end, unsupervised learning based models are utilized that are learned to identify pre-fault states of controlled virtual and physical machines and then to notify system administrators about potential faults and their reasons. In particular, states of naturally different instances (computation nodes) are monitored.
Further prior art, which relates to this field can be found in document U.S. Pat. No. 9,336,494 B1, disclosing re-training a machine learning model.
According to this document, machine learning models functioning for financial knowledge domain (e.g. price predictions, card fraud detection, financial product transactions and so forth) are addressed. In detail, it is disclosed to detect if a model misbehaves (based on control model predictions for a set of time-ordered instances within a sliding window with predefined size) and how to fix the misbehaving model.
If malware is detected on the basis of a detection model which is learned beforehand on the basis of pre-existing data (“training data” or “training set”), reliability of the decision on maliciousness of checked objects is strongly dependent on the representativeness of the pre-existing data for the checked object. If the pre-existing data (on the basis of which) is not (or not anymore) representative for expected objects to be checked, reliability of results of respective checks is deteriorated.
Accordingly, it is evident that available systems for responding to security threats suffer from various drawbacks, and it is thus desirable to improve machine learning based malware detection systems so as to overcome such drawbacks.