In today's world of wide information availability, searching documents with queries of the kind “which documents contain word X ?” may take unacceptable time (e.g., several hours) and computing power to execute an exhaustive scan through all available documents (considering several millions of documents). The state of the art of information retrieval solutions allow the building of indexes which map words to documents. Such an index, generally called inverted-index, offers a representation to answer those queries of the kind “which documents contain word X ?”. Storing an index thus optimizes speed and performance in finding relevant documents for a search query.
A query is often expressed in a form like “car AND (boat OR ship)”. Handling of functions with Boolean operators is a computer process handled at a high level while at the low level the query is processed as simple queries like “which documents contain word X”. Such a single word query is generally executed following the steps of:
(a) looking into a term dictionary table to identify a Term_id associated with a searched word; and
(b) looking into a term occurrence table to identify documents (Doc_id) associated with the Term_id previously identified.
Both steps are performed on tables, and specifically on a first column of tables generally ordered alphabetically or numerically. Accessing tables that may be stored on a disk device may be a very slow operation as random access to rows implies physical disk head movement. To this extent, to balance slow access, an efficient bisection algorithm, requiring log 2(N) steps, where N is the number of rows of the tables, is often required.
A single word query is illustrated with reference to FIG. 1, where a simple inverted index is used. For sake of illustration, only three documents are shown (0, 1, 2) each including a short sentence (102). However, the person skilled in the art would extend the description to longer text such as entire paragraphs or chapters of a book. A term extraction operation is applied to each sentence which leads to isolating each word by removing the punctuation and converting all words into their lower-case (as shown on table 104). A term dictionary table (106) is built from the term extraction table, where each item of a row (TERM) is one word of the term extraction table identified by a term_id (TERM-ID). The value of the identifier term-id may be the position of the term in the list, for example its position in the alphabetically order. A term occurrence table (108) is built from the term dictionary table mapping each term-id listed in the term dictionary table to the corresponding set of documents (DOC-ID) wherein the original word appears (note that each term-id can be associated to more than one doc-id, indicating the word is present in several documents).
In the example of FIG. 1, 14 term-Ids (0 to 13) have been mapped to 18 Doc-Ids in the term occurrence table 108. It is to be appreciated that in a real context of millions of indexed documents, the term dictionary table can grow to around the size of the dictionary for the language in use (in the order of 10^5 terms), while the size of the term occurrence table will always be equal to the number of documents times the average of the number of words per document (10^9 entries or more).
Moreover, in many if not all applications, one would benefit from retrieving documents containing small variations of a word specified in the query. For example, when searching for the word “election”, it could be highly interesting that a document not including the word “election” but including the word “elected”, be identified as well. Some linguistic analysis can be done algorithmically on each word to identify the “stem” form of a word. A stem form is the form from which all the variations of a word are generated. For example, the singular form of a name is the origin of the plural form, the infinite form of a verb is the origin for the past form and the progressive form. In some languages (i.e., French, Italian) the number of variations for a word can be very high. Stemming also includes derived adjectives, adverbs or names. For example, the form “base” can be considered the stem for the words “basic”, “basically”, “based”, “basing”, “bases”. An alternate method to algorithmic stemming is the thesaurus approach. FIG. 2 shows a table (200) illustrating stemmed forms (204) obtained from original words (202).
Several approaches are known for executing stemmed searches using an inverted-index, assuming a stemming algorithm is available for the language of the documents indexed. The most known and used methods are now listed.
A first approach is to build an inverted index of stemmed words, as illustrated on FIG. 3. To allow better understanding of the description, the same set of documents (302) with same sentences are used as in FIG. 1. A term extraction operation using a stemmed algorithm is applied to each sentence which leads to the creation of a term extraction table (304). A term dictionary table (306) is built from the term extraction table (304) where each item (TERM) is associated with its position number in the list (TERM-ID), for example here its position in the alphabetically order. A term occurrence table 308 is built from the term dictionary table and contains pair of integers (TERM-ID/DOC-ID) wherein each term-id listed in the term dictionary table is mapped to the corresponding set of documents (doc-id) wherein it appears. In this example, the size of the term dictionary table (306) is equal to 10 items, which is smaller than for a simple extraction as in FIG. 1 because many terms collapse into a single term. The size of the term occurrence table (308) remains identical to the case of a simple extraction. However, a drawback of such an approach is mainly that the inverted index of stemmed words does not permit searches for original words, because the queried terms have to be preprocessed with the same stemming algorithm used for indexing. The original words are then ‘lost’ by the stemming operation and do not appear in the dictionary anymore. This loss of functionality is rarely acceptable. A remedy to this drawback is often encountered by building two indexes, one for original words and one for stemmed forms, and to use the appropriate index when searching a word. However, gaining complete search functionality results in a doubling of the index size, thus highly impacting the search speed, particularly because of worse cache memory hit ratio.
Another approach to achieve stemmed and original words search ability is to expand a query to be executed into a logical OR of several queries, each query aiming at a different variation of the stemmed form of the word that is specified in the original query. For example, if the query is on the word “are” and a stemming operation is active, all words which have a stemmed form of “be” will be searched. A reverse-stemming algorithm or a thesaurus will enumerate all the derived forms of the stemmed word (that is for word “be” the expansion would target the variations “be”, “being”, “been”, “am”, “is”, “are”, “was”, “were”), and the query will be executed as the logical operation “be” OR “being” OR “been” OR “am” OR “is” OR “are” OR “was” OR “were”. With the expanded query, a complete functionality is achieved and the index size is not increased. However, search performance is degraded as one search is expanded into N ORed logical searches. Additionally, with such a method the child searches can be “sparse” anywhere in the index. For example, taking the eight variations of the word <<be>>, a search into the term dictionary table (106) would provide two candidates TERM_ID0 and TERM_ID5, the variants “are” and “is”. As the two TERM_IDs are not adjacent, they would have to be read on different parts of the term occurrence table. As this table is generally a very large database, searching speed is highly impacted by such query expansion searching method.
U.S. 2002/0059161 to Li discloses a method and apparatus for query expansion using reduced size indices and for progressive query processing. Queries are expanded conceptually, using semantically similar and syntactically related words to those specified by the user in the query to reduce the chances of missing relevant documents. The notion of a multi-granularity information and processing structure is used to support efficient query expansion, which involves an indexing phase, a query processing and a ranking phase.
Accordingly, searching in a very large collection of documents by the known methods to perform conceptual similarity search is either a search in an inflated index or a sequence of sub-searches. An alternative remains to use only the original form of words and to ignore variations, but this lowers the quality of the search results.