A full-text indexing system typically extracts content from unstructured text data (usually drawn from a relational database) and constructs one or more indexes or catalogs containing full-text indexes to facilitate efficient and rapid searching. Indexing refers to the overall process of processing text data and creating index entries derived from that text data.
A full-text search engine of a relational database returns results of queries against the indexes built on the unstructured text data. A full-text indexing and search engine thus may gather and filter data and then index the resulting words and properties from the documents into an index or catalog. It may also process queries for specified words and properties and then return references to the documents in the index or catalog that contains the specified items. One common use of a full-text indexing and search engine is a search engine for web sites.
A full-text indexing and search engine typically builds, maintains and queries full-text indexes. Indexing text is typically more complex than indexing values. For example, text being indexed is usually extracted from the database via a protocol component, and filtered by a filtering component to extract the text and values from the source. Text extracted by filters may be passed through wordbreakers to identify lexical constructs and tokenize on word boundaries. These word boundaries, in the English language, are typically whitespace or some form of punctuation. In other languages, such as Chinese, words or characters may be combined together or have other semantics that determine word boundaries so other means of tokenizing must be employed.
Querying full-text indexes is slightly different than executing standard relational queries for much the same reasons that indexing text is more complex than indexing values. To cite just one example, a user who runs a query on “daffodils” probably also would like to see documents that contain the word “daffodil”. Hence, a stemmer is another common component of a full-text search engine. A stemmer is a component that determines the morphological root of a given inflected (or, sometimes, derived) word form. For example, in English, searching for the word “swim” is likely to also return documents with words like “swimming”, “swam”, “swum,” and so on.
Query terms are passed to the full-text indexing and search engine, which transforms the query in much the same way as the index was built in order to be able to compare the query specification to the full-text index. The indexes are traversed, and typically a key and rank value to an underlying RDBMS record is returned.
When a version mismatch occurs between the components used to generate an index and the components used to query the index, unpredictable and undesirable results may occur. For example, changing a wordbreaker without rebuilding or resetting the index may result in retrieving different results today (based on a change in that wordbreaker's tokenization semantics) than that returned from the same search done on the same database yesterday. Typically, whenever a component is changed, for example in an upgrade or service pack, all the full-text indexes must be rebuilt in order to be sure that search requests will return correct results.
Rebuilding indexes can be a painful process for users, especially those users with very large databases. Rebuilding indexes can take days and while the rebuilding is taking place, full-text search capabilities are not typically accessible. At times, indexes are rebuilt that do not really need to be rebuilt. For example, suppose a new German wordbreaker is shipped in a service pack. Because of the inclusion of the wordbreaker, and because the vendor typically does not know what components the customer uses (and in some cases, the customer may not know the full scope of all components that are being used), the customer is likely to be told to rebuild his indexes, even if the customer has no German documents and has never used the German wordbreaker. It would be helpful if there were a way to minimize these and other problems associated with component mismatch in build and query components.