Full text search (FTS) systems search for relevant documents based on key words entered by a system user. The user enters a set of terms, referred to as tokens, and the FTS system finds documents containing all of the terms in the set. In order to support queries efficiently, the FTS system typically uses inverted indexes. For example, Lucene (described at http://lucene.apache.org/) and SQLite's FTS module (described at http://www.sqlite.org/cvstrac/wiki?p=FtsUsage) are both FTS systems that use inverted indexes.
An inverted index assigns a set of document identifiers to each token. The document identifiers are associated with documents that include the token at least once. Upon receiving a search request, the FTS system selects the set of document identifiers for each token in the request and then compares the document sets to each other. If a document identifier is contained in all document sets, the FTS system provides the document identifier in a result set of all identifiers contained in all document sets.
From a logical point of view, the inverted index can be regarded as a relation InvertedIndex(Term, DocID) with an combined index on Term and DocId. The inverted index allows the FTS system to efficiently execute queries such as Query 1:
SELECT DocID FROM InvertedIndex WHERE Term=‘Neuschwanstein’
If only a small number of documents belong to the result set, the FTS system's performance is generally good. If a user searches for documents that contain two terms ‘Bavaria’ and ‘Neuschwanstein,’ the FTS system executes a query such as Query 2:
SELECT DocID FROM InvertedIndex WHERE Term=’Bavaria’INTERSECTSELECT DocID FROM InvertedIndex WHERE Term=’Neuschwanstein’Assume a database has one million documents containing the term ‘Bavaria’ and ten documents containing the term ‘Neuschwanstein.’ Although the size of the result set for Query 2 is equal to the size of the result set for Query 1, Query 2 takes much longer as the FTS system has to iterate over one entire million document identifiers belonging to the term ‘Bavaria.’
Another problem with FTS systems occurs when a search request includes a non-selective token. A non-selective token is a token found in most documents, such as the so called stop-words like “and,” “or,” and “the.” For example, if a search request includes the two tokens “NAVTEQ” and “the,” there may be one hundred documents in the database containing the token “NAVTEQ” and one million documents containing “the.”In order to generate the result set, the one hundred document identifiers associated with the documents including the term “NAVTEQ” are compared to the one million document identifiers associated with the documents including the term “the,” which may take a long time. Therefore, some FTS systems do not include stop-words like “the” in the inverted index.
However, some non-selective token are more meaningful than others. Thus, it would be beneficial to have a more efficient way to handle those non-selective but nevertheless meaningful terms when performing a full text search.