The invention relates generally to storage and retrieval of text-containing documents, and more specifically, to matching a search string to words extracted from documents.
The invention has particular but not exclusive application to document retrieval systems used on the World-Wide Web (xe2x80x9cWebxe2x80x9d). There is currently a wide-spread need for compact search engines and databases that quickly identify and retrieve documents, such as Web pages, in response to search queries. Such queries are usually directed to finding documents that contain specific words.
Various aspects of such document retrieval systems are well known. It is common practice, for example, to parse documents and to create a lexicon containing words extracted from the documents. To reduce storage and simplify operation, words in the lexicon are assigned unique identifying numbers, and a document look-up table uses such numbers, rather than character strings, to identify documents that contain particular words. Various types of searches are known, including exact match searches, prefix searches, and wildcard searches. Also of interest are searches referred to as xe2x80x9cfuzzyxe2x80x9d searches, which identify terms loosely matching a search string.
The invention is concerned primarily with the word matching process underlying such systems. Various techniques are known for matching words with search strings. A string can be compared sequentially with each word in a lexicon to identify a matching word set but such a process is very time consuming. A complete indexing of characters in each word permits very fast exact match and prefix searches but places considerable demand on disk space. Numerous techniques are known for partial indexing of word lists on prefix values (starting characters) or word length with a view to reducing the number of words that must actually be compared with a search string. A very well known search technique involves use of a binary tree. The search algorithm associated with a binary tree very quickly reduces the number of lexical nodes that must be compared with a search string. However, the search algorithm repeatedly accesses a disk drive storing the tree structure as nodes are traversed, which severely impairs retrieval time. Another problem is that prior art methods do not necessarily lend themselves to performing various searches, including wildcard and fuzzy searches, quickly and effectively.
In one aspect, the invention provides a document retrieval system that retrieves documents in response to a search string identifying one or more words expected to be found in documents of interest. The system includes a lexicon that stores a collection of words extracted from the documents and associates each word with an identifying number. A document look-up table relates the word numbers to documents containing the associated words to permit identification and retrieval of appropriate documents. A word look-up table groups the words of the lexicon into sets with common characteristics (preferably prefix values and length), and a character look-up table identifies whether any word in the lexicon contains a specific character. In response to a search string, set generating means access the word look-up table to identify a set of target words whose characteristics correspond to characteristics of the search string. Set refining means then reduce the target set by selecting a set of characters from the search string, accessing the character look-up table to identify whether each target word uses the selected character set, and excluding from the target set those words that do not contain either the entire character set or a predetermined number of the selected characters. String comparison means then access the lexicon to perform a direct comparison of the words remaining in the target set with the search string.
The search process associated with the system has several advantages. The preliminary target set is normally a small subset of the lexicon, which reduces relatively time-consuming direct comparison of words with the search string. The set refining process further reduces the target set, culling words that do not use the same character set as the search string or a subset of those characters. Although character sequencing and frequency are important factors in predicting a word match, the requirement for a common character set and equal or similar word lengths results in a high probability that any word remaining in the refined target set is a close match for the search string. In instances where no matching word exists, the result is often reported before any direct string comparisons are performed. Moreover, the search process lends itself to implementation of various searches, including fuzzy and wildcard searches, as will be apparent from the description of preferred embodiments.
The character look-up table can be conveniently implemented as a compact boolean array whose dimensions correspond to character value and word number and whose entries consist of a single bit. Word numbers are preferably assigned in such a manner that the word look-up table returns a target set consisting of consecutive word numbers for each set of words in the lexicon with common characteristics. This permits the set refining process to take advantage of the maximum bit processing count available from a digital processor when accessing the character look-up table, effectively culling groups of words simultaneously from the target set. Using a conventional 32-bit processor, words can potentially be eliminated in 32-member sets. Since boolean operations are inherently fast and since word numbers can be culled simultaneously according to a processor""s maximum bit processing count, a very significant speed advantage is obtained.
The term xe2x80x9csetxe2x80x9d as used in this specification in respect of search criteria, word lengths, matching criteria and values, word characteristics, and search string characteristics should be understood as identifying a set consisting of one or more members. Word sets and word number sets should be understood as potentially being null or empty. The term xe2x80x9ctargetxe2x80x9d as applied to a set of words or a set of word numbers identifies a set expected to contain, but not necessarily containing, a word or a word number associated with a word that will match a search string. The word xe2x80x9ccommonxe2x80x9d as used in this specification in respect of a set of characteristics, prefix values, word lengths and the like, refers to a specific value shared by a set of items.
The specification refers to the xe2x80x9cexcludingxe2x80x9d of word numbers from target sets. Such exclusion can take different forms depending in large measure on how the target set is represented. For example, when forming a preliminary target set, selecting upper and lower range numbers to define the target set excludes other words and word numbers identified in the lexicon. During set refining, the target set may be represented with a string of bits, each bit corresponding to a different word number in the target set, and a word number may be excluded by setting the corresponding bit to 0. When converting the bit string representation of the target set to a list of word numbers, word numbers are effectively excluded by recording only those numbers associated with words likely to match a search string. Accordingly, the term xe2x80x9cexcludingxe2x80x9d and comparable terms as used in respect of word numbers associated with a target set should be understood as encompassing any manner of identifying that a word number is not, or no longer remains, a member of the target set.
Other aspects of the invention and associated advantages will be described with reference to preferred embodiments, and various aspects of the invention will more specifically defined in the appended claims.