Global networking of computers has greatly affected business. As the number of computers linked to networks grows, businesses increasingly rely on networks to interact. More and more people use electronic mail, websites, various file transfer methods, and remote office applications, among other types of software, to facilitate business transactions and perform job related tasks. Networks such as the Internet transmit data packets across the network using long-standing addressing technologies and flow control protocols. Historically, these protocols were designed for use on a trusted network and as such do not include many security features. To address this problem, newer protocols are designed to include some security measures. However, at present, the global Internet and many local area networks predominantly use older protocols with various vulnerabilities.
Hackers and malfeasants take advantage of the weaknesses in these protocols to disrupt, infiltrate, or destroy networked devices. In some cases, attackers take advantage of the trusting relationships between computers to infiltrate a network and spread computer instructions that are referred to as a virus. Viruses infect files and use vulnerabilities of programs that interpret the files to propagate. For example, a virus program may be sent to a user as an attachment to an e-mail message. When the user uses his e-mail program to open the attachment, the virus is triggered and uses the e-mail system to propagate to other computer systems within the network. A virus may also function to erase data.
Once a virus infiltrates a network, the virus typically spreads very rapidly, quickly infecting a large number of files and disrupting business operations. In such situations, the time required to discover and repair infected files is often critical. Viruses are typically detected by searching for a virus “signature,” which is a pattern of data indicating that a file has been infected. A virus detection program is an application program that searches data on the network to determine whether the virus signature is found in the data. Once infected files are identified, a virus repair application program may be used to repair the infected files.
Most computer systems in which virus detection programs operate use an underlying file system. The file system provides a “layer” of software in the computer system to manage storage space for the files. This layer is between the operating system (which communicates directly with devices) on the computer system hosting the file system and an application program that uses the data in the files. Typically, a searching application, such as the virus detection program described above, calls a read interface provided by the file system to read the files in preparation for performing a search. The searching application provides the name of the file(s) to read, and the file system determines the physical locations on the device(s) storing the files, reads the data from those physical locations, and presents files to be searched to the searching application. The searching application typically then performs the search of the files provided by the file system on a file-by-file basis.
Searching data on a file-by-file basis provides some advantages. For example, data for a given file may be stored in several non-contiguous storage locations on a storage device. The file system handles assembly of the files from the data in these non-contiguous storage locations and provides a copy of the files to the searching application. However, this service comes at a cost, as overhead introduced by the file system in constructing files can significantly affect the time and resources required to perform a search.
Additional overhead is incurred when files share data blocks. For example, backup copies of primary production data may be made periodically to “freeze” images of the data at given points in time. These backup copies can be used to recover from failure of a computer system, storage device, or network. To save storage space, often data that is the same in the primary production data and in the backup copy is stored only once, along with information that will enable the primary data and/or the backup copy to be reconstructed in the event of failure or corruption of the data. Unfortunately, when constructing files as described above, file systems typically do not recognize shared storage locations. Instead, the file system treats the shared storage locations as part of each file, thereby requiring resources to read the shared storage locations once for each file. For file systems managing very large files, this duplicate effort can be very time-consuming and adds overhead to searching of the files.
A solution is needed to enable file content searches to be performed quickly and efficiently, with a minimum amount of duplicate effort. Preferably, the solution can take advantage of existing storage management tools but avoid unnecessary overhead to perform the search.