This invention relates to a method and apparatus for performing multiple string searching.
A string is defined herein as a sequence of symbols from an alphabet. For example, a string may be a text string, formed from a sequence of ASCII characters. As another example, a string may comprise a DNA sequence, based on the four-symbol DNA xe2x80x9calphabetxe2x80x9d.
The string-search problem is to find the first (or all) occurrences of one or more key strings (or xe2x80x9cpatternsxe2x80x9d) within a target string (or xe2x80x9ctextxe2x80x9d).
The simplest case is where there is just a single key string. The single string search problem has been widely studied in the Computer Science community. A good survey of this problem is available on the World Wide Web at: http://www.dir.univ-rouen.fr/xcx9ccharras/string/string.html.
The present invention is concerned with the problem of multiple string searching, that is, of finding occurrences of a plurality of key strings within a target string. The following references relate to the multiple string search problem:
Aho, A. V., and M. Corasick, xe2x80x9cEfficient String Matching: An Aid to Bibliographic Search,xe2x80x9d CACM June 1975, Vol.18, No.6
Commentz-Walter, Beate, xe2x80x9cA String Matching Algorithm Fast on the Average,xe2x80x9d Technical Report, IBM-Germany, Scientific Center Heidelberg, Tiergartenstrasse 15, D-6900 Heidelberg, Germany
Haertel, Mike, xe2x80x9ckwset.cxe2x80x9d (part of the GNU grep command), http://www.gsi.de/gnu/grep-2.0/
The object of the present invention is to provide a novel solution to the multiple string search problem.
According to the invention a data processing system comprises searching means for finding occurrences of a plurality of key strings within a target string, wherein the searching means comprises:
(a) means for forming a hash value from each of the key strings, and for adding each key string to a collection of key strings having the same hash value;
(b) means for selecting a plurality of symbol positions in the target string;
(c) means for forming a hash value at each selected symbol position in the target string and for using this hash value to select one of the collections of key strings; and
(d) means for comparing each key string in the selected collection of key strings with the target string.