1. Field of the Invention
This invention relates generally to computer virus detection, and, more specifically, to a system and method for detecting computer viruses that span multiple data streams.
2. Description of the Prior Art
Computer viruses attach themselves to data streams. Examples of data streams include executable program files, data files, and boot sectors, such as those on floppy and hard disks. A virus replicates when the data stream to which it is attached is accessed, allowing the virus to infect additional data streams. Such infections can severely damage a computer system. Consequently, many virus detection programs have been developed to detect viruses.
To detect viruses, known virus detection technologies develop a signature for each known virus. Each data stream to be scanned for viruses is then examined to determine whether the data stream contains a known virus signature. If a virus signature is found in a data stream, the virus detection technology will conclude that the virus corresponding to the found signature exists in that data stream.
Traditionally, the entire body of a virus has been wholly contained within one data stream. As a result, current virus detection technologies scan each data stream for an entire virus signature and examine each data stream for viruses independently of other data streams.
Recently developed operating systems and file formats include data entities that are a collection of data streams. Examples of these entities are Apple Macintosh files that consist of code and resource forks, Microsoft Windows executable files that consist of multiple code and data sections, and Microsoft Word or Excel documents that are stored as a collection of data streams in an OLE 2 compound storage file. Although physically these examples are single files, logically they comprise two or more data streams.
Since these various entities can comprise two or more data streams, viruses have been developed that have components spread out over several data streams. Consequently, known virus detection techniques, which examine each data stream independently of the other data streams, will often fail to correctly identify a virus whose components span multiple data streams.
For instance, known virus detection technologies will often fail to correctly identify Microsoft Word macro viruses. Macro viruses are composed of one or more macros, with each macro residing in a separate data stream.
Consider a macro virus consisting of two macros. One of the macros, call it A, includes the code that performs the replication. The other macro, call it P1, is the payload of the virus. If the virus A/P1 was the only one including either macro A or macro P1, then the virus A/P1 can be detected and identified by just detecting the signature for either macro A or macro P1. In this situation, where A/P1 can be uniquely identified by one of its components, the signature for A/P1 can be just the signature for one of its components. If the signature for A/P1 is just the signature for one of its components, current virus detection methods may be sufficient, assuming the whole of the component is found in a single data stream.
Consider a new macro virus that is also comprised of a macro A, but has a different payload macro, call it P2. This situation may arise when a virus writer decides to copy an existing virus to reduce extra work and then modifies the payload of the virus. Now, simply detecting the signature for macro A is insufficient because the signature does not differentiate the two viruses. However, it would be sufficient to have signatures just for the payloads, P1 and P2, since each of the viruses has a different payload. In other words, if one signature is developed for macro P1 and if one is developed for macro P2, the two viruses can be detected and differentiated from one another. In this case, known virus detection methods may also be sufficient.
If a virus writer creates virus B/P1, viruses A/P1, A/P2, and B/P1 will exist. Now, neither the signature for macro A nor the signature for macro P1 alone uniquely identifies virus A/P1. The virus can be uniquely identified only by the combination of the signatures for macros A and P1. However, since macro A may exist in a different data stream than macro P1, known virus detection systems that operate only in the context of a single data stream are insufficient to uniquely identify virus A/P1.
To further illustrate the point, assume the virus writer decides to combine viruses A/P1 and A/P2 so that this newly created virus, call it A/P1/P2, comprises macro A and the two payloads P1 and P2. None of A/P1, A/P2, and A/P1/P2 can be uniquely identified by only one of their components. They can be identified only by a combination of components, and therefore, if the components of each virus are spread out over multiple data streams, it is not possible to detect these viruses by known detection means that scan data streams independently of each other. Thus, it is desirable to have a virus detection technology that can detect and correctly identify viruses whose components span multiple data streams.