The present invention relates in general to scanning for viruses and other malware, and, in particular, to a system and method for performing partitioned scanning of a distributed dataset for viruses and other malware in a distributed computing environment.
Information networks interconnecting a wide range of computational resources have become a mainstay of corporate enterprise computing environments. In general, most such environments consist of several host computer systems interconnected internally over an intranetwork to which individual workstations and network resources are connected. These intranetworks, also known as local area networks (LANs), make legacy databases and information resources widely available for access and utilization throughout the corporation and provide a means for retrieving, reading, and posting news messages. These same corporate resources can also be interconnected to wide area networks (WANs), including public information internetworks, such as the Internet, to enable internal users access to remote computational resources, such as the World Wide Web and Usenet Newsgroups, and to allow outside users to select corporate resources for the purpose of completing limited transactions or data transfer.
Most internetworks and intranetworks are based on a layered network model in which a stack of standardized protocol layers cooperatively exchange information between various systems. In particular, the Transmission Control Protocol/Internet Protocol (TCP/IP) suite, such as described in W. R. Stevens, xe2x80x9cTCP/IP Illustrated,xe2x80x9d Vol. 1, Ch. 1 et seq., Addison-Wesley (1994), the disclosure of which is incorporated herein by reference, is the most widely adopted network model. Computer systems and network devices employing the TCP/IP suite implement a protocol stack, which includes a hierarchically structured set of protocol layers beginning with the link protocol layer and proceeding upwards to the network, transport, and application protocol layers. Each protocol layer performs a set of pre-defined functions as specified by the official TCP/IP standards set forth in applicable Requests for Comment (RFC).
TCP/IP computing environments in particular make a wide range of content and services available, including electronic mail, network news, and Web pages. Network news within the TCP/IP environment is popularly referred to as xe2x80x9cInterNet Newsxe2x80x9d or simply xe2x80x9cUsenet,xe2x80x9d shorthand for the Usenet news system. The Usenet continues to be an area of sustained growth. Historically, the Usenet began as a set of mailing lists containing textual news messages sent to a group of subscribers. However, the Usenet now consists of over fifty thousand newsgroups, most of which receive a tremendous volume of news messages daily. Moreover, news messages now can contain non-textual content, such as raw or encoded binary data, and are potentially much larger in size than traditional textual news messages. In light of the sheer numbers of newsgroups and subscribers and individual news message sizes, centralized news servers have replaced the original mailing lists as an efficient approach to storing and retrieving messages for an anonymous audience.
The widespread usage of the Usenet has also been matched by an increased, albeit minority, presence of unauthorized content. Like electronic mail, news messages are an efficient and powerful medium for exchanging information that is widely available, easy to use, relatively fast, and flexible. These same advantages make news messages an attractive vehicle with which to introduce unauthorized content that includes computer viruses, Trojan horses, hoaxes, xe2x80x9cSpamxe2x80x9d mail, and other forms of xe2x80x9cmalware.xe2x80x9d Unauthorized content can be introduced, often surreptitiously, into a news message body, as an attachment, or even as Web content.
Even more than electronic mail, news messages are widely distributed to a multitude of computing environments, some of which may not be equipped with virus scanners. Moreover, the potential for widespread computer virus infection is particularly strong when combined with news message services. The most efficient method of combating malware is to scan every news message body before they are disseminated to individual users.
One prior art approach to scanning stored Usenet messages on a centralized news server is the Virus Patrol program, used by Network Associates, Inc., Santa Clara, Calif., as further described below with reference to FIG. 2. However, this approach is computationally constrained by a single process executing on a single server system. These inherent limitations prevent the program from scaling to meet the requirements of scanning a continually growing database with significant news message traffic for viruses and other malware. At some point, the sheer number of newsgroups and messages exceeds the capabilities of the program to scan and keep up with the message traffic.
Therefore, there is a need for a solution providing a distributed system for scanning a large dataset, including a news database. Such a solution would scale to provide the processing and bandwidth throughput required of a continually growing dataset. Preferably, the solution would provide concurrent processing with low bandwidth synchronization and high availability through a centralized database.
The present invention provides a system and method for concurrently scanning a large dataset for computer viruses and other forms of malware. The dataset is organized into a set of distributed databases each containing a plurality of groups storing individual data items. The data items are each uniquely identified by an identifier and can be included in a plurality of the groups. A plurality of malware servers cooperatively scan the groups for viruses and malware by using a commonly shared centralized database for tracking and synchronization. Scanned data items are tracked using a message identifier table and a last read table both maintained within the centralized database. The scanning of multiple part data items is synchronized using a threads table with the centralized database. Thus, the concurrent malware scanners can divide up the groups for concurrent processing in a highly scalable manner.
An embodiment of the present invention is a system and a method for performing partitioned scanning of a dataset for malware in a distributed computing environment. A dataset is maintained in a plurality of structured databases in the distributed computing environment. Each database stores a plurality of data item groups which each include a plurality of individual data items. Each such data item is uniquely identified within the dataset by a data item identifier. A set of indices is stored in a centralized database. The set of indices includes a list of scanned data item identifiers for each data item within the dataset scanned for malware and a list of last entry numbers for each data item group stored in each database. Each last entry number corresponds to one such data item within the data item group last scanned for malware. A plurality of malware scanners are executed in substantial concurrency. For each malware scanner, one such database and each such data item group within the selected database having data items not appearing in the list of last entry numbers are selected. Each such data item having a data item identifier not appearing in the list of scanned data item identifiers is obtained. Each such obtained data item is scanned for malware.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is described embodiments of the invention by way of illustrating the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.