The present invention relates to a method of, and system for, heuristically detecting malware in macros and executable scripts, by checking for the presence of encoded strings.
Viruses and other forms of malware (malicious software) pose an ever-increasing problem for computer users and users of the internet. Early viruses tended to be spread as binary executables which would execute on computers of a known type. A traditional measure against virus infections is anti-virus file scanning in which a file, treated as a succession of bytes, is scanned looking for byte patterns that have been identified as characteristic signatures of known malware. This virus scanning may take place on files stored on a local or network disk drive or files which are in transit on a network, with the scanning taking place as they pass by a particular node on the network or pass through a network gateway such as an email gateway. Since the use of the internet for web access, email and other purposes has become widespread, and the software to be found on a typical user's machine has become more sophisticated, the opportunity has arisen for virus writers to create viruses which are distributed in essentially source code forms. These so-called script- or macro viruses rely on the user's computer having software on it which will act as an execution environment for a program, that is the virus, which arrives at the computer e.g. as an email attachment.
Features added to an operating system to increase the ease of use by non-technical users offer opportunities for virus writers to exploit. For example, Microsoft Windows has a facility whereby files having a certain file extension (that is, the character(s) following the final “.” in the file name) are associated with a particular application program such that the act of the user “opening” the file in Windows' graphical user interface causes Windows to activate the associated program and load the file in question. This has provided the basis for recent script viruses which work by attaching a file containing the source code to an email which is then distributed to users; the file is given the extension necessary to activate the script host program on a recipient's computer. When the recipient opens the attachment, the host is activated and the files executed. There have been a number of virus outbreaks in which the virus is spread as a script for the Visual Basic Scripting host on Windows machines.
Similarly, a number of end-user applications such as word processors and spreadsheets incorporate a “macro” facility for enabling the user to automate repetitive or difficult tasks. The macro “language” involved may be of differing degrees of sophistication, with some, such as those found in Microsoft Office products, being very similar to parallel scripting languages. A minor difference is that scripts tend to be stored, on disk and elsewhere, purely in source code form, whereas macro files may include binary data as well as or instead of textual source code. For example, this binary data may include a “tokenised” version of the macro's source code, or actual executable machine codes.
In addition to Visual Basic Scripting and Microsoft Office macros, there are several computer languages where the source code is available in the executable file. This might be because the language is interpreted without being compiled, for example, Perl. Virus writers who use such mediums often try and make their creations hard to detect. They may do this writing code that is hard to understand. They may also write self modifying code, so that with each generation of the code, the code subtly changes. This is done in order to make it hard for anti-virus vendors to create signatures that detect the virus in all its infinite variations. They may also hide virus code in comments or strings. These are then read in, decrypted, and acted upon.
Signature-based scanning is ineffective against script- and source-code-only macro viruses, because their contents do not correspond one-for-one with machine instructions, since there are many different ways the same programmatic action may be expressed in source code and source code may be transformed in various ways, for example in terms of usage of “whitespace”, i.e., space, tab and newline characters, or substitution of variable names which will alter the contents of the source code without altering its effect.
The present invention is based on an appreciation of the fact that, as regards script- and macro-viruses, in some of these cases, it is possible to heuristically detect the virus by frequency analysis of character counts in various parts of the program.
According to the present invention there is provided a system for scanning for malware a computer file containing source code of a computer program in a given computer language comprising:
means for separating the source code into groups of constituent parts corresponding to different structural parts of the program;
means for processing each part to count the number of occurrences in that part of characters of a character set to obtain a frequency distribution of characters in that part;
means for comparing the character frequency distribution of each part with an expected range of frequency distributions; and
means for flagging the file as suspect or not depending on the result of one or more comparisons by the comparing means
The invention also provides a method for scanning for malware a computer file containing source code of a computer in a given computer program language comprising:
separating the source code into groups of constituent parts corresponding to different structural parts of the program;
processing each part to count the number of occurrences in that part of characters of a character set to obtain a frequency distribution of characters in that part;
comparing the character frequency distribution of each part with an expected range of frequency distributions; and
flagging the file as suspect or not depending on the result of one or more comparisons by the comparing means.