The term “malware” is short for malicious software and is used to refer to any software designed to infiltrate or damage a computer system without the owner's informed consent. Malware can include viruses, worms, trojan horses, rootkits, adware, spyware and any other malicious and unwanted software. Many computer devices, such as desktop personal computers (PCs), laptops, personal data assistants (PDAs) and mobile phones, can be at risk from malware.
Detecting malware is often challenging, as malware may be designed to be difficult to detect, often employing technologies that deliberately hide the presence of malware on a system. For example a malware application may not show up on the operating system tables that list currently-running processes on a computer.
An anti-virus application for detecting viruses and other malware may make use of various methods to detect malware including file scanning, integrity checking and heuristic analysis. During file scanning, the anti-virus application examines files for the presence of virus fingerprints or “signatures” (i.e. code sequences) that are characteristic of known malware. Typically, this requires that the anti-virus application make use of a database containing the signatures pushed to it, for example from an Internet based server. An example of a heuristic analysis detection approach involves collecting features arising during execution of a code sample. Examples of features that may be collected during execution are stacks, heaps, strings, API calls and their parameters. Due to the malicious nature of the software being searched for, execution cannot be performed live on a computer device, so instead the execution takes place in a sandbox environment. A sandbox is a virtual environment that has a very tightly-controlled set of resources. This allows unknown or untrusted software to be executed in such a way that any malicious activity does not affect the computer device on which it is being executed. The software is executed within the sandbox, and features of the execution of the file code as described above are collected and analysed to detect the existence of malware. Analysis may involve comparing the detected features against features previously identified by analysing known malware (that analysis being done at a back end server of the anti-virus application provider).
When trying to detect malware, it is important to avoid false positives as much as possible. A false positive is returned when the anti-virus application identifies software as being suspected malware, when in fact it is not. False positives create inconvenience and product dissatisfaction for users, who only want the anti-virus application to detect genuine malware, and are also undesirable from the point of view of the anti-virus application providers as they result in increased workload arising from customer queries and complaints.
A method for reducing false positives might be as follows:                i) scan a sample set of clean files and collect all features, counting each unique feature only once and ignoring duplicates;        ii) scan a set of malware files and collect all features, counting each unique feature only once and ignoring duplicates;        iii) remove the features found in the set of clean files from those found in the malware files;        iv) determine the most common feature that is found in the set of malware files;        v) if the most common feature within the set of malware files is present in more than a certain pre-defined number of files, then that feature can be saved to a database as being suitable for generic detection of malware, and the files in which that feature is found are removed from the inspected set of malware files;        vi) repeat from step iv) looking at the most common feature found in the remaining files in the set of malware files until a minimum feature count is reached, and ignore all further features.        
The features recorded on the database are identified as being characteristic of that malware sample, and are therefore distributed to clients running the anti-virus application.
Despite such efforts to eliminate features likely to give rise to false positives, there remains a high risk of features being selected that are unsuitable for malware detection. The empirical nature of the feature rejection process also gives rise to a lack of confidence in the chosen features. It is possible of course to use the created database of features to scan a further selection of known clean files, and then remove any features from the database that identify any of the known clean files as malware. However this process can take a considerable amount of time, and still does not provide any certainty that all the features are suitable.