This disclosure relates to content analysis and inspection.
Data inspection engines, such as virus scanners and spyware scanners, operate on content, such as files and file streams. When a security system is required to inspect content items for many users, such as thousands or tens of thousands of users, these data inspection engines are typically used in two modes of operation: (i) buffering each entire content item to be examined and submitting the content item to a data inspection engine for security processing; or (ii) submitting part of the content item in a content stream for inspection and maintain a scanning state for each content stream. The first mode of operation requires significant memory resources and significantly increases response time, such as in the case when the content item is a large file. The second mode requires associations of multiple streams for a content item with scanning states. This often results in the maintenance of multiple states and multiple streams.
To avoid the executing a data inspection process each time a content item is encountered, a signature (e.g., checksum) of the content item can be checked to determine if it matches a signature of a previously scanned content item. The response time for the processing of some content items can be improved as the content item inspection can be avoided if the signature of the content item matches a signature of previously scanned content item. However, full buffering of the content item to compute the signature of the content item is required, which increases memory resources and response time.
Winnowing is another process that can be used to avoid execution of a data inspection process each time a content item is encountered. Winnowing divides the content item into a number of fixed length segments. For each fixed length segment, the signature of the segment is computed and compared to a previously computed signature for the segment of a previously scanned content item. If all the signatures match, then content inspection can be avoided. Though the winnowing method can be used to inspect partial content (i.e., before the content item has completely been received), the amount of signature computation to be done is too large resulting in significant increases in user visible response time. In addition, winnowing identifies patterns in the file stream, which may be common to files having malware and files not having malware. This may result in false positives for a given file stream, resulting in unnecessary content inspection.