Malicious content such as viruses infects files and uses vulnerabilities of programs that interpret the infected files to propagate. For example, a virus program may be sent to a user as an attachment to an e-mail message. When the user uses his e-mail program to open the attachment, the virus is triggered and uses the e-mail system to propagate to other computer systems within the network. A virus may also function to erase data or otherwise interfere with the desired operation of a computer system or network.
Malicious content such as viruses is typically detected by means of signature files. Signature files contain instructions and/or information that can be used by a detection program (e.g., an antivirus program) when analyzing a file for the presence of malicious content. Detection programs can employ various detection techniques, including scanning files for a pattern, decompressing code, executing the file in a virtual machine, and the like. For example, one technique can involve scanning a file for a pattern that includes a string of characters, binary computer code, data embedded within a virus, or the like. A detection program is an application program that uses one or more signature files to determine whether malicious content is present in any specified files. If infected files are identified by the detection program, a repair application program may be used to repair the infected files.
Many modern detection techniques do not require reading all the bytes of a file. Not needing to read files in their entirety can result in significant time savings when very large files are being processed by a detection program. Today's detection algorithms may examine only portions of a given file for evidence of malicious content. If an initial investigation indicates that further examination is warranted, the entire file may be evaluated.
Most computer systems in which detection programs operate use an underlying file system. The file system provides a “layer” of software in the computer system to manage storage space for the files. This layer is between the operating system (which communicates directly with devices) on the computer system hosting the file system and an application program that uses the data in the files. Typically, a detection program, such as the detection program described above, calls a read interface provided by the file system to read the files in preparation for performing a search. The detection program provides the name of the file(s) to read, and the file system determines the physical locations on the device(s) storing the files, reads the data from those physical locations, and presents files to be searched to the detection program. The detection program typically then performs the search of the files provided by the file system on a file-by-file basis.
Using a file system to scan files provides some advantages. For example, data for a given file may be stored in several non-contiguous storage locations on a storage device. The file system handles assembly of the files from the data in these non-contiguous storage locations and provides a copy of the files to the detection program. However, this service comes at a cost, as overhead introduced by the file system in constructing files can significantly affect the time and resources required to perform a search. In addition, the file system may itself be compromised by malicious content and thus may be unavailable or unreliable for use in the detection of malicious content.
Additional overhead is incurred when a file system is used to read files and those files share data blocks. Many file systems make some use of a technique called “single-instancing” whereby data blocks (or even entire files) having identical contents are stored only once. Although the file system may provide the appearance that many different files or data blocks just happen to have identical contents, only one copy is actually stored. Single-instance storage can be used, for example, when backup copies of primary production data are made periodically to “freeze” images of the data at given points in time. These backup copies can be used to recover from failure of a computer system, storage device, or network. To save storage space, often data that is the same in the primary production data and in the backup copy is stored only once, along with information that will enable the primary data and/or the backup copy to be reconstructed in the event of failure or corruption of the data. Another use of single-instance storage is when several users share the same storage volume. Many of the users may maintain personal copies of the same file. Whenever this situation arises, the file system can make use of single-instancing to only store a single copy of the file, which is shared among users.
Unfortunately, when an application such as a detection application needs to process a set of files that make use of single-instancing, the file system typically treats the information in the shared storage locations as part of each file that includes the data stored in those shared storage locations. This in turn means that resources are needlessly expended to read the shared storage locations once for each individual file that includes the data stored therein. For file systems managing very large files, this duplicative effort can be very time-consuming and adds overhead to searching of the files.
A solution is needed to enable malicious content detection to be performed quickly and efficiently, with a minimum amount of duplicate effort. Preferably, such a solution can take advantage of existing storage management tools but avoid unnecessary overhead to analyze whether malicious content is present.