Full text indexing and searching are useful capabilities in many applications today, such as in Internet search engines or local or single-site searching. For example, Apache Lucene is an open source full text indexing engine that creates an inverted index by reading documents and tokenizing them. The index is thus the result of extracting terms and identifying various metadata associated with the terms, such as which documents the terms are from and the placement of the terms in the particular documents. Thus, an index comprises a list of terms. Each instance of the occurrence of a term in a document may be associated with the term in the index. For example, an identifier of the document and the location of the occurrence in that document may be associated with the term for each instance of the occurrence of the term.
Search engines may utilize such indices to search the documents. For example, Apache Solr is a search framework that uses such an index for full-text searching. In general, Solr receives a query and tokenizes the search terms, identifying Boolean operators and implementing a search of documents using the index created from those documents.
To illustrate, if a query was “Victor Frankenstein,” the search engine might identify “Victor” and “Frankenstein” and build a binary search “Victor AND Frankenstein,” and search for documents that contain both terms using the index created from those documents. Similarly, wildcard operators may be employed in certain search engines. These wildcard operators may be used to perform prefix searches, which comprise a prefix followed by a wildcard, and suffix searches, which comprise a wildcard followed by a suffix. If a wildcard prefix operator was employed in a search query for a prefix search, the search engine would generate a search for all documents that contained the prefix. For example, suppose the search was “Frank*,” where “*” is the wildcard operator. In this case, the search engine would generate a search for documents which contain terms which begin the prefix “Frank.” Such a prefix search may be relatively straightforward to implement, as the index created from the documents being searched may be sorted alphabetically by term. Thus, when searching using the documents, the index may be relatively quickly processed to determine which terms begin with “Frank” and the documents associated with these terms determined.
However, a similar search for a suffix may be quite a bit more complicated. For example, suppose a suffix search of the nature of “*ing,” where “*” is a wildcard operator is desired.
In this case the search engine may perform a search for all documents containing a term ending in “ing”. As the structure of the index created from the documents being searched may comprise an alphabetically sorted list of terms this search may entail, for example, a string comparison with every term in the index beginning with the last letter of the term and the last letter of the suffix (e.g., a reverse string comparison), to locate each term in the index which ends in the suffix of interest. The documents associated with those terms can then be determined. Since indices can include tens, if not hundreds of thousands of terms, performing such searches can be relatively time-consuming.