The present invention relates to the management of files in a distributed system. More particularly, this invention relates to efficiently managing the transmission of files in a distributed system.
In many distributed systems, units of digital data that require processing such as files, queries, or other requests for service (hereinafter xe2x80x9cfilesxe2x80x9d) pass through a number of nodes during processing. These nodes are typically processes running on different machines (e.g., servers, workstations, desktop computers, laptops, pervasive systems, etc.) in the network. However, nodes can be logical processes, some of which run in the same machine. Each node may perform some, all or none of the processing required for a given file, and if further processing is required, it may pass the file along to another node. When a file is fully processed, typically the result is routed back toward the originator of the query, and/or some other recipient(s). Each node contains a certain amount of data that is used in processing files.
Congestion occurs in a distributed system when one or more nodes or communication links between nodes become too busy to handle the traffic destined for them. Either a node cannot process files as fast as they arrive, or a communication link cannot transmit files from one node to another fast enough to prevent queues from building up. When part of a system is congested, there will be files which are awaiting processing and/or transmission. Therefore, in a system in which there is a sequence or hierarchy of nodes through which a file is directed to flow, the earlier in the sequence of nodes a file can be fully processed and the results determined, the fewer nodes the file will have to pass through during processing, and the less severe congestion will be.
One particular distributed system in which congestion can be a severe problem is the computer virus xe2x80x9cimmune systemxe2x80x9d as described in Kephart et al., xe2x80x9cFighting Computer Viruses,xe2x80x9d Scientific American, November 1997, hereby incorporated by reference. In an xe2x80x9cimmune system,xe2x80x9d personal computers (PCs) are connected by a network to a central computer which analyzes viruses. Each PC incorporates a monitoring program which uses a variety of heuristics to infer that a virus may be present. The PCs, upon discovery of a suspect file, send a copy of the file to the central computer for analysis. After manual analysis by operators, the central computer eventually is instructed to transmit a prescription for verifying and removing the virus (assuming one is found). The prescription may be sent to one or more of the PCs as an update to be applied to databases maintained on those machines.
In a system in which a substantial number of PCs are connected to the network, suspect files may become queued up waiting for transmission to the central computer (or to a next higher level in a multi-level distributed system). For example, in distributed system 100 of FIG. 1, the units of digital data passed from node to node are files which are suspected of containing viruses, Trojan horses, worms or other types of malicious code. The nodes (which can be data processing systems such as disclosed in FIG. 1a of U.S. Pat. No. 5,440,723, (hereinafter xe2x80x9c""723 patentxe2x80x9d) hereby incorporated by reference, or processes running therewithin) are organized in a hierarchy, such that suspect files found on one or more client machines 110 are first passed to one or more administrator machines 120, at which limited processing takes place, then passed to one or more gateway machines 130, and finally passed to a central analysis center 140, if necessary. Just as more than one gateway machine can be (and are preferably) utilized in this system, several analysis centers can be utilized for this function, if need be. Furthermore, the nodes can be logical so that a file determined to be suspect by a human user or by a process within a client computer 110 can be forwarded to another process (node) within the client computer 110. Indeed, there can be multiple logical nodes within any components of the distributed computer system 100, including the analysis center 140. In such systems, during the active spreading of a new fast-spreading virus, many suspect file copies containing the same virus code are likely to be submitted by the client machines 110 in a small period of time. This would cause serious congestion throughout the system 100.
Simple caching utilized in present systems will not be sufficient to prevent congestion, because any given client machine 110 or administrator machine 120 is unlikely to see a wide enough variety of files. That is, caching only the result of prior analysis of its own submitted files will not prevent enough congestion. Only by being apprised of the results of files submitted by others, and of general results (i.e. new virus definitions that apply to a given virus in any host file), will the analysis center be sufficiently shielded from redundant requests.
As indicated hereinabove, immune systems provide for updates to local databases which are utilized to eliminate the need to forward files or requests up the system hierarchy. However, these updates are initiated only after manual analysis or processing by a human operator at the analysis center. This is insufficient in an environment in which a substantial number of files or requests are generated in sometimes short periods of time. Furthermore, present systems have no method of managing the inevitable backlog of queued files or requests that must be forwarded up the hierarchy to another node
Finally, current methods of speeding up components such as Web browsing are not designed to handle the sudden massive congestion that can be caused by a piece of replicating malicious code.
Therefore, there is a need for a system and method to efficiently manage the transmission of files or other units of digital data up the hierarchy from node to node so as to reduce the number of redundant files that are transmitted through the system.
Specifically, there is a need for a system and method for filtering or eliminating the necessity for further transmission of a file to another node by utilizing information which is updated by automatic processing at local or remote nodes.
Furthermore, there is a need for a system and method for prioritizing the files which are not filtered for transmission to other nodes, including identifying the order of transmission of these files and a need for a system and method for updating the data necessary to manage these decisions in a manual or automatic manner, as required.
The present invention is a system and method for increasing the efficiency of distributed systems and reducing congestion, by using the results of processing at a node to update the data used in processing at that and/or other nodes sometime in the future. Specifically, the present invention provides, in a network-connected distributed system including two or more nodes through which digital data flow, one or more of the nodes adapted to process the digital data, a method for efficiently managing the transmission of units of digital data from node to node, the method including the steps of: receiving, at one of the one or more nodes, one or more units of digital data first transmitted by an originating node; filtering out sufficiently processed units of the digital data based on filtering information; transmitting, to the originating node and/or other nodes, filtered results relating to the sufficiently processed units; queuing, for processing at other nodes, unfiltered units of the digital data which are not filtered out; and updating the filtering information according to results of automatic processing performed in and received from the one of the one or more nodes and/or other nodes in the system.
In one embodiment, the distributed system can include nodes for the reporting and analysis of incorrect or buggy software, the units of digital data can include files, and the transmitting step can include the step of returning updated information on bugs and fixes to the originating node and/or to other nodes.
In another embodiment, it is preferable that the distributed system includes a system for the analysis of complex geographically-based data such as satellite images, the units of digital data include requests for information about a particular geographical area, and the transmitting step includes the step of returning updated information on areas which have already been analyzed in response to prior queries to the originating node and/or to other nodes.
In yet another embodiment, the distributed system includes a system for the computation of integrals, and the units of digital data include queries of formulae to be integrated.
The units of digital data can include queries or files. In one embodiment where this is the case, the distributed system includes a computer protection system, the units of digital data include files and/or checksums of files which are suspected to contain malicious code and the transmitting step includes the step of returning updated protection information to the originating node and/or to other nodes. The malicious code can include computer viruses, worms or Trojan Horses.
Preferably, the filtering step includes the steps of determining whether a file is identical to a known non-malicious file, and identifying the file as sufficiently processed in response to the determining step. The updating step preferably includes the steps of receiving, from other nodes in the system, modification detection codes of files that have been determined to be non-malicious, and adding the modification detection codes to the filtering information.
The filtering step can also include the steps of determining whether a file cannot contain malicious code because it does not contain any code at all, and identifying the file as sufficiently processed in response to the determining step.
The filtering step can optionally include the steps of determining whether a file cannot contain malicious code because it does not contain enough code to constitute even the smallest anticipated virus; and identifying the file as sufficiently processed in response to the determining step.
Finally, the filtering step may include the steps of determining whether a file contains known malicious code that is correctly handled by an existing protection definition, and identifying the file as sufficiently processed in response to the determining step. In this case, the updating step preferably includes the steps of receiving, from other nodes, protection definitions for malicious code that has been analyzed, and adding the definitions to the filtering information.
In all cases, it is preferable that the updating step includes the step of re-executing the filtering step to apply the updated filtering information to the queued units of the digital data.
The units of digital data can also include queries including a database version of the originating node and a request for an updated version, if available, wherein the filtering step includes the step of determining whether the one of the one or more nodes has a more recent database version and wherein the updating step includes the step of updating originating filtering information of the originating node and/or other nodes of the system that are likely to have older versions. The units of digital data can include queries including a database version of the originating node and a request for a updated version, if available, and the updating step can include the step of updating the originating filtering information of the originating node and/or other nodes of the system that are likely to have older versions. The database version preferably corresponds to the filtering information
In another embodiment, the distributed system preferably includes a computer protection system, the units of digital data include samples of undesirable textual messages and the transmitting step includes the step of returning updated protection information to the originating node and/or to other nodes.
Another embodiment includes, in a network-connected distributed computer protection system including a plurality of nodes through which digital data flow, one or more of the nodes adapted to process the digital data, a method for efficiently managing the transmission of suspect files from node to node, the method including the steps of receiving, at one of the one or more nodes, a checksum of a suspect file transmitted by an originating node; if a checksum match is found based on filtering information, identifying the suspect file as sufficiently processed; else causing the receiving, at the one or more nodes, of the suspect file; filtering out sufficiently processed units of the digital data based on the filtering information; transmitting, to the originating node and/or other nodes, filtered results relating to the sufficiently processed files; queuing, for processing at other nodes, unfiltered files which are not filtered out; and updating the filtering information according to results of automatic processing performed in and received from the one of the one or more nodes and/or other nodes in the system.
Another aspect of the invention includes a system for efficiently managing the transmission of units of digital data from node to node in a distributed network comprising a plurality of nodes, at least one of the nodes including a filter adapted to filter out sufficiently processed units of the digital data based on filtering information; the filtering information being updatable according to results of automatic processing performed in and received from one of the plurality of nodes in the system.
Finally, another embodiment of the present invention includes, in a network-connected distributed system including a plurality of nodes through which digital data flow, one or more of the nodes adapted to process the digital data, a method for efficiently managing the transmission of units of digital data from node to node, the method including the steps of receiving, at one of the one or more nodes, units of digital data first transmitted by an originating node; filtering out sufficiently processed units of the digital data based on filtering information; transmitting, to the originating node and/or other nodes, filtered results relating to the sufficiently processed units; queuing, for processing at other nodes, unfiltered units of the digital data which are not filtered out; prioritizing the unfiltered units of digital data for transmission to a next node based on prioritizing information; and updating the filtering information and the prioritizing information according to results of automatic processing performed in and received from the one of the one or more nodes and/or other nodes in the system. The updating step optionally comprises the step of re-executing the filtering step and/or the prioritizing step to apply the updated filtering and prioritizing information to the queued units of for the digital data. The units of digital data can comprise queries or files.
Also, a system for efficiently managing the transmission of units of digital data from node to node in a distributed network includes a plurality of nodes, at least one of the nodes including a filter adapted to filter out sufficiently processed units of the digital data based on filtering information, the filtering information being updatable according to results of automatic processing performed in and received from one of the plurality of nodes in the system; and a prioritizer adapted to prioritize units of the digital data queued for transmission to another node based on prioritizing information, the prioritizing information being updatable according to results of processing performed in and received from one of the plurality of nodes in the system.