Information retrieval refers to the process of identifying occurrences in a target document of words in a query or query document. Information retrieval can be gainfully applied in several situations, including processing explicit user search queries, identifying documents relating to a particular document, judging the similarities of two documents, extracting the features of a document and summarizing a document.
Information retrieval typically involves a two-stage process: (1) In an indexing stage, a document is initially indexed by (a) converting each word in the document into a series of characters intelligible to and differentiable by an information retrieval engine, called a "token" (known as "tokenizing" the document) and (b) creating an index mapping from each token to the location in the document where the token occurs. (2) In a query phase, a query (or query document) is similarly tokenized and compared to the index to identify locations in the document at which tokens in the tokenized query occur.
FIG. 1 is an overview data flow diagram depicting the information retrieval process. In the indexing stage, a target document 111 is submitted to a tokenizer 112. The target document is comprised of a number of strings, such as sentences, each occurring at a particular location in the target document. The strings in the target document and their word locations are passed to a tokenizer 120, which converts the words in each string into a series of tokens that are intelligible to and distinguishable by an information retrieval engine 130. An index construction portion 131 of the information retrieval engine 130 adds the tokens and their locations to an index 140. The index maps each unique token to the locations at which it occurs in the target document. This process may be repeated to add a number of different target documents to the index, if desired. If the index 140 thus represents the text in a number of target documents, the location information preferably includes an indication of, for each location, the document to which the location corresponds.
In the query phase, a textual query 112 is submitted to the tokenizer 120. The query may be a single string, or sentence, or may be an entire document comprised of a number of strings. The tokenizer 120 converts the words in the text of the query 112 into tokens in the same manner that it converted the words in the target document into tokens. The tokenizer 120 passes these tokens to an index retrieval portion 132 of the information retrieval engine 130. The index retrieval portion of the information retrieval engine searches the index 140 for occurrences of the tokens in the target document. For each of the tokens, the index retrieval portion of the information retrieval engine identifies the locations at which the token occurs in the target document. This list of locations is returned as the query result 113.
Conventional tokenizers typically involve superficial transformations of the input text, such as changing each upper-case character to lower-case, identifying the individual words in the input text, and removing suffixes from the words. For example, a conventional tokenizer might convert the input text string
The father is holding the baby.
into the following tokens:
the PA1 father PA1 is PA1 hold PA1 the PA1 baby
This approach to tokenization tends to make searches based on it overinclusive of occurrences in which senses of words are different than the intended sense in the query text. For example, the sample input text string uses the verb "hold" in the sense that means "to support or grasp." However, the token "hold" could match uses of the word "hold" that mean "the cargo area of a ship." This approach to tokenization also tends to be overinclusive of occurrences in which the words relate to each other differently than the words in the query text. For example, the sample input text string above, in which "father" is the subject of the word "held" and "baby" is the object, might match the sentence "The father and the baby held the toy," in which "baby" is a subject, not an object. This approach is further underinclusive of occurrences that use a different, but semantically related word in place of a word of the query text. For example, the input text string above would not match the text string "The parent is holding the baby." Given these disadvantages of conventional tokenization, a tokenizer that enacts semantic relationships implicit in the tokenized text would have significant utility.