The current flood of e-mail, newsgroup postings, and other network broadcasts has renewed interest in SDI (selective dissemination of information) systems, in which text samples are run against relatively static subscriber profiles. A text sample that satisfies a profile is selected and subsequently routed to the appropriate subscriber.
Selection is often done by matching keywords specified in the profiles. In advance of processing any sample text, words of interest are compiled in a dictionary. Each text sample is then processed by a keyword scanner, which consults the dictionary, looking for matches between dictionary words and substrings of the text. More generally, the text sample may be a sample stream consisting of arbitrary bytes, and the dictionary may consist of keywords or patterns of an arbitrary nature. In practice, such a keyword scanner need accommodate a large number of arbitrary keywords (say, ten thousand to one hundred thousand) and rapidly produce a vector of alarms, identifying keywords found and each occurrence location in the sample stream. Other processes may then use the alarm vector for selection and routing purposes.
The prior art contains several methods for searching a sample stream for all occurrences of dictionary keywords. One of the most successful approaches is the Aho-Corasick algorithm (A. V. Aho and M. J. Corasick, `Efficient String Matching: An Aid to Bibliographic Search,` Communications of the ACM, Vol.18, 1975, reprinted in Computer Algorithms: string pattern matching strategies, Jun-ichi Aoe, ed., IEEE Computer Press, Los Alamitos, Calif., pp. 78-85, 1994). The Aho-Corasick algorithm implements a finite state machine, which undergoes a transition with each byte read from the sample stream. When states representing the ends of keywords are reached, the keyword matches are reported. The Aho-Corasick algorithm must read every byte in the sample stream, limiting the speed unacceptably. It also requires a great deal of memory for fast operation.
An alternative method was disclosed by Commentz-Walter (B. Commentz-Walter, `A String Matching Algorithm Fast on the Average,` Proceedings of the International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, Vol. 71, H. A. Mauer, ed., Springer-Verlag, Berlin, 1979) and improved by Baeza-Yates and Regnier (R. Baeza-Yates and M. Regnier, `Fast Algorithms for Two Dimensional and Multiple Pattern Matching,` SWAT 90, 2nd Scandinavian Workshop on Algorithm Theory, Lecture Notes in Computer Science, Vol. 447, J. R. Gilbert and R. Karlsson, eds., Springer-Verlag, Berlin, pp. 332-347, 1990). This method, too, employs a finite state machine, driven by bytes taken backwards from a pivot point in the sample stream. Unlike the Aho-Corasick algorithm, it may "skip" (without examinination) some of the sample stream bytes by advancing the pivot by more than one byte, but it is slowed by the need to revisit bytes already examined. For large dictionaries, revisiting dominates skipping, and it is slower than the algorithm of Aho-Corasick. It has memory requirement commensurate with the Aho-Corasick algorithm.
Other methods require the use of special-purpose hardware to be competitive.