Widespread usage of the Internet has led to more widespread occurrences of destructive computer “viruses” that often cause extensive computer network damage or downtime. A “virus” is a piece of programming code usually disguised as something else that causes some unexpected and usually undesirable event (for the victim). Viruses are often designed so that they automatically spread to other computer users across network connections. For instance, viruses can be transmitted by sending them as attachments to an e-mail note, by downloading infected programming from other sites, and by inserting into a computer a diskette or CD-ROM containing a virus. The source application that deals with the e-mail note, downloaded file, or diskette is often unaware of the virus. Some viruses wreak their effect as soon as their code is executed; other viruses lie dormant until circumstances cause their code to be executed by the computer. Some viruses can be quite harmful, causing a hard disk to require reformatting or clogging networks with unnecessary traffic.
With the rapid development of distributed computing technology, more and more interpreted programming languages, such as scripting languages, are designed to satisfy the requirement of heterogeneous computing and development environments. Some examples of scripting languages are JavaScript, VBScript, Perl and the UNIX shell. Although run-time performance of scripting languages is typically poor, these languages offer good cross-platform support and are widely used in applications. Some scripting languages are specially designed for Internet applications and are used in web publishing and are supported by most web browsers. Some examples of these scripting languages are JavaScript and VBScript. Some web browsers not only support the scripting languages in the web browser, but also enhance the functions to include additional features. A significant enhancement, or feature, is the ability to access local resources, including local applications and local files. These new enhancements also introduce the potential for security breaches in which malicious codes could obtain unauthorized access to local resources. Currently more than one hundred viruses utilize this feature to propagate and damage or destroy the host system or related resources. Often it only takes a matter of days or even hours for a virus to spread worldwide. Thus, efficient virus detection and identification stems the spread of the virus by enabling the viral pattern to be identified and communicated to others to aid its detection and removal prior to infection of further computer systems.
Anti-virus (or “anti-viral”) software is a class of program that searches computer code, such as that found on a computer's hard drive and floppy disks, for any known or potential viruses. The market for this kind of program has expanded because of Internet growth and the increasing use of the Internet by businesses concerned about protecting their computer assets. However, with the improvement of anti-virus technology, virus technology has improved too.
FIG. 1 is a table illustrating a general overview of the evolution of virus forms. Earlier viral forms, first generation, were generally written in assembly language and distributed in binary code form. The host platform on which the virus was run was usually a certain central processing unit (CPU) or certain operating system (OS). Typically, the host object in which the virus resided was executable code and the propagation of the virus was through physical media, for example, floppy disks.
More recent viral forms, second generation, are being written in an interpreted language, such as a scripting language, and are distributed in source code. The host platform is usually a certain computer application that could run on many CPUs or operating systems. The host object is typically a document, such as application documents or e-mail, and the virus is propagated through networks, such as the Internet or intranets.
FIG. 2 is a representative block diagram in the prior art showing an overview of a conventional system 100 for virus detection and identification. Typically, scripting source code 104 is extracted from a text file 102 as input to a virus scan engine 106. There, the code 104 is compared against identified virus patterns in a pattern file 108, often from a pattern file database, until a matching virus pattern is found. An output message 110 then results that presents the results of the virus scan, such as the identity of any virus found. In cases where it is known that the source code 104 contains a virus, and a virus pattern cannot be matched, typically, the source code 104 is then utilized in forming a new virus pattern for input to the pattern file 108.
Current virus scanning technology in the scan engine 106 is based on byte code matching algorithms where exact matches of code pattern strings in code 104 are made with an identified virus pattern file. The pattern file typically contains the most important code pattern strings of a virus code pattern. This pattern matching process enables the detection of exact matches of the virus code, e.g., unmodified scripting virus code, but is ineffective with polymorphous scripting viruses.
Polymorphs of scripting viruses are versions of an original scripting virus made by what are often small changes to the original virus form, i.e., the code is not exactly the same as the original virus but still has the same effects. Polymorphs are easy to create and can be developed by individuals and/or by using polymorph engines.
Polymorph engines are generally computer programs that are capable of making lexical and grammatical transformations of the code, for example, manipulation of white space, renaming of identifiers, and/or changing the program layout. Typically, polymorph engines, although prolific, cannot reliably change the execution order of statements.
A polymorph of a virus that involves rearrangement of the execution order is more typically created by an individual. Currently, lexical transformations are the more typical polymorph form. Unfortunately, proliferation of polymorphs can quickly outpace the detection and identification efforts of current virus scanning methods that utilize exact pattern matching. Polymorphs can do this by continually including small modifications to the known viral code.
Accordingly, what is needed in the field is a method and/or apparatus for detecting and identifying a scripting virus pattern so that the virus and its polymorphs may also be detected and identified.