Virus detection is a problem of surprisingly wide impact on any computer user who is routinely asked to take preventative measures against viruses, such as buying and running antivirus software. A virus is a data in the form of text, executable code, etc., that is added to or overwrites data in a user's file without the user's authorization, and generally without the user's knowledge. Research in the area of virus detection includes various heuristic approaches targeting specific classes of viruses. Some of the most successful modern techniques attempting to solve this problem fall into the general paradigms of signature detection and integrity checking, e.g. E. Skoudis, “MALWARE: Fighting Malicious Code”, Prentice Hall (2004), and P. Szor, “The Art of Computer Virus Research and Defense”, Addison Wesley (2005). The former paradigm requires discovering pieces of infected code, called signatures, for known viruses, storing the signatures, and developing software that scans the computer memory to search for such signatures. The latter paradigm, on which this invention focuses, requires using cryptographic hash functions that detect unauthorized changes to a file, and potentially reveal the presence of unknown viruses. An important example of the success of the latter paradigm is Tripwire, a widely available integrity checking program for the UNIX environment.
Intrusion detection principles of signature and anomaly detection, as discussed for example in G. Di Crescenzo, A. Ghosh, and R. Talpade, “Towards a Theory of Intrusion Detection” also supply insight into virus detection methodology. The signature virus detection paradigm is similar to the signature detection principle in the intrusion detection area; the integrity checking paradigm, by contrast, is more similar to the anomaly detection principle in the intrusion detection area.
Available antivirus software typically uses three main techniques for detecting viruses: signatures, heuristics, and integrity verification. The signature technique is similar to the signature detection approach in intrusion detection systems. First, known viruses are studied and signatures of them are stored; then occurrences of these signatures are looked for in candidate executable files. Although this is the most popular approach for virus detection, it relies on quick update of the signature database by vendors and of their signature files by users, and it is easily defeated by polymorphic and metamorphic virus techniques.
The other two techniques, heuristics and integrity verification, are more similar to the anomaly detection approach in intrusion detection systems. Heuristic techniques may be somewhat sophisticated in that they attempt to identify viruses based on some behaviors that they are likely to exhibit, such as attempts to write into executable files, to access boot sectors, to delete hard drive contents, etc. Integrity verification techniques try to detect unexpected modifications to files after the infection has happened, but potentially before the execution of the infected file occurs, thus still making the infection harmless.
While both heuristics and integrity verification techniques have the potential of catching more intelligent viruses, such as those equipped with polymorphism and metamorphism capabilities, the techniques are at most able to raise an alert on a particular file, which later has to be carefully emulated and analyzed in the virus diagnosis phase under a controlled environment where a conclusion about the location, nature and consequences of the potential virus need to be derived. Due to the difficulty of the realization of an accurate controlled environment for emulation, the accuracy of the derived consequences may not be trustworthy. Moreover, in many cases, the modification carried by the virus to the original file is very minimal, e.g., a subroutine call to a program located somewhere else in memory, and therefore it would be very helpful to have additional information about the virus itself.
Further, the integrity verification technique or integrity checking principle only detects changes to the file, but does not localize or indicate where, within the file, the changes occur. Absent localization information about the virus, its detection is very resource-expensive and failure-prone. This implicitly defines a new problem in the area of software security, “virus localization”.
The problem of virus localization has never been rigorously investigated or even posed before, as far as the inventors know. Applying cryptographic hashing to the data is a well-known paradigm for data integrity verification, and is fundamental for programs that verify the integrity of file systems, like Tripwire. Cryptographic hashing of all atomic blocks of a file is also a known paradigm, and has been used for programs that remotely update files over high latency, low bandwidth link, or address write-once archival data storage purposes. However, none of these programs solves the virus localization problem.