Full text search (FTS) systems search for relevant documents based on key words entered by a system user. The user enters a set of terms, referred to as tokens, and the FTS system retrieves documents that contain all of the terms in the set. In order to support efficient queries, the FTS system typically uses inverted indexes. For example, Lucene (described at http://lucene.apache.org/) and SQLite's FTS module (described at http://www.sqlite.org/cvstrac/wiki?p=FtsUsage) are both FTS systems that use inverted indexes.
An inverted index assigns a set of document identifiers to each token. The document identifiers are associated with documents that include the token at least once. Upon receiving a search request, the FTS system selects the set of document identifiers for each token in the request and then compares the document sets to each other. If a document identifier is contained in all document sets, the FTS system provides the document identifier in a result set of all identifiers contained in all document sets. From a logical point of view, the inverted index can be regarded as a relation InvertedIndex(Token, DocID) with a combined index on Token and DocID. The inverted index allows the FTS system to efficiently execute queries such as Query 1:
SELECT DocID FROM InvertedIndex WHERE Token=‘Neuschwanstein’ If only a small number of documents belong to the result set, the FTS system's performance is generally good. If a user searches for documents that contain two terms ‘Bavaria’ and ‘Neuschwanstein,’ the FTS system executes a query such as Query 2:
SELECT DocID FROM InvertedIndex WHERE Token=’Bavaria’INTERSECTSELECT DocID FROM InvertedIndex WHERE Token=’Neuschwanstein’
Assume a database has one million documents containing the term ‘Bavaria’ and ten documents containing the term ‘Neuschwanstein.’ Although the size of the result set for Query 2 is equal to the size of the result set for Query 1, Query 2 takes much longer as the FTS system has to iterate over one million document identifiers belonging to the term ‘Bavaria.’
While the inverted index works well in some applications, there is still room for improvement. For example, when the choice of search terms is limited, other full text search system designs may provide faster query response times.