1. Field of the Invention
This invention relates to the field of data processing systems. More particularly, this invention relates to searching within a sequence of character data for occurrence of one or more words each formed of a predetermined sequence of characters.
2. Description of the Prior Art
It is known to provide content scanning systems that search within files of character data to identify predetermined strings. A typical such system would be an E-mail filtering system in which the strings being search for were members of a predefined collection of banned words, such as swear words.
A problem with such systems is that the volumes of data needing to be checked, such as the volume of E-mail traffic, is tending to increase at the same time as the number of strings that need to be checked for is also increasing. These factors combine to disadvantageously increase the processing load associated with content scanning of this type. Measures that can address this problem are strongly advantageous.
Viewed from one aspect the present invention provides a computer program product comprising a computer program operable to control a computer to search within a sequence of character data for occurrence of any of a plurality strings each formed of a predetermined sequence of characters, said computer program comprising:
(i) character identifying logic operable to identify a target character at a target search position within said sequence of character data;
(ii) string identifying logic operable to compare character data preceding said target character with data indicative of any of said plurality of strings that terminate with said target character to identify an occurrence of any of said plurality of strings that terminate with said target character; and
(iii) target search position advancing logic operable to advance said target search position within said sequence of character data by a predetermined number of character positions equal to a smallest number of character positions separating said target character from a last character within any of said plurality of strings.
The invention provides a modified form of the Boyer-Moore Fast String Searching Algorithm as is described in the publication xe2x80x9cA Fast String Searching Algorithmxe2x80x9d, R. S. Boyer with J. S. Moore, Communications of the Association for Computing Machinery, Volume 20, No. 10, page 762-772, 1977. In particular, the invention recognises that this approach may be adapted to search for a plurality of strings each formed of a predetermined sequence of characters in a single pass through the document in a manner that increases the overall efficiency of the search and allows it to cope with increasing data volumes and increasing numbers of strings for which a search is to be made.
The advancing of the search position is preferably controlled by a single jump table established for all of the plurality of strings for which a search is being made. The jump table holds for each character the closest proximity of that character to the end of any of the plurality of strings for which a search is being made. If a character does not appear in any of the strings for which a search is being made, then a jump equal to the shortest of the strings for which a search is being made.
A tree of search portions is kept. There are two distinct portion types in the structure.
An implementor may wish to describe explicitly within the portion its type.
1. A leaf portion. This is represented by:
The maximum length of all strings it contains.
The Depth of this portion in the tree.
A table of the strings matched by this portion.
(The type of the portion).
2. A Branch portion. This is represented by:
The depth of this portion.
The strings which are impicitly matched by this portion.
A table indexed by character describing which portion continues the search.
The portion it describes could be a leaf or branch portion.
(The type of the portion).
The preferred implementation has a different structure for the first portion in the tree. This is because it has no implicit matches (0 length strings) and an opportunity to optimize the first characterxe2x86x92portion lookup.
Other aspects of the invention also provide a method for searching within a sequence of character data for predetermined strings and an apparatus for searching within a sequence of character data for predetermined strings.
The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.