Traditional malware scanning methods depend on knowing malware signatures beforehand. After collecting all known malware samples, a backend system generates a malware pattern using the known instance-based malware signatures and distributes the pattern to customers. This is called the “virus dictionary approach.” This approach is considered to be reliable and only causes a marginal number of false positives. Because of its accuracy in detecting malware, this approach is extensively used by the industry. The open source virus scanner “ClamAV” is one example.
The use of a virus dictionary, however, has some disadvantages. Such a scanner will not identify unknown malware. This disadvantage causes systems protected by this approach to be exposed to new threats between the time the malware is released to the field and the backend system delivers a new pattern to the customer site. Another disadvantage occurs when new variants of existing malware are released. If the virus dictionary uses techniques such as an SHA-1 hash, then the new variants will not be in the virus dictionary. Also, the number of malware programs has grown dramatically in the past couple years. The hash-based malware patterns bloat the size of the dictionary accordingly. Identifying malware using large malware dictionaries can consume too much memory and use a lot of CPU cycles.
Due to the inadequacy of current techniques in detecting unknown malware and preventing zero-day attacks, some systems are based on behavior monitoring. In the paper titled Learning and Classification of Malware Behavior, the tool CWSandbox is used for extracting features and an SVM is used for performing learning and classification. But, the use of run-time behavior monitoring has a number of disadvantages: it requires more computational power from the defending machines which in turn drags down the performance of all other programs on the same platform; and some malware does not exhibit its malware behavior if it can determine that it is being monitored (for example, while it is in a sandbox).
The following issues also need to be addressed: it may be necessary to identify previously unknown malware variants in an organization; an organization may not want to report malware to anti-virus companies due to privacy; and it is important to minimize the computational burden on the client machines within an organization, both in terms of memory usage, and in terms of CPU cycles. Regarding privacy, the organization may not want to divulge the raw file to the virus researcher, which makes virus detection and signature generation difficult.
Thus, it is desirable to speed up virus scanning and to reduce the memory footprint without relying on instance-based malware patterns or behavior monitoring.