1. Field of the Invention
The present invention relates generally to document searching. In particular, the present invention is directed to identifying occurrences of a large number of search patterns in a large number of documents.
2. Description of Background Art
Multi-pattern searching is useful in many different applications, including data mining, editing, and security. One particular application that requires multi-pattern searching is the identification of known code fragments in a set of data files. This is useful, for example, to determine whether and to what extent open-source code has been integrated into an organization's proprietary code.
In “A Fast Algorithm For Multi-Pattern Searching,” (S. Wu and U. Manber, “A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING”, Report TR-94-17, Department of Computer Science, University of Arizona, 1994), incorporated by reference herein in its entirety, Wu and Manber present an algorithm for quickly finding matches between online texts and multiple search patterns. Using that algorithm limits the number of search patterns that can be used, because the algorithm is designed to operate entirely in primary storage. Moving the data structures presented in that paper to secondary storage without modification results in undesirably—and for massive numbers of search patterns, impractically—poor performance.
Accordingly, what is needed are a system and method for multi-pattern searching that allows search pattern data to exist in secondary storage while maintaining good performance with hundreds of millions to billions of search patterns.