The following disclosure relates to techniques for indexing terms included in a collection of one or more documents, for example, by including in an inverted list associated with an index term information about pairing the index term with one or more common terms within the collection of documents.
Search engines can be used to locate keywords or phrases in a collection of documents. A search query typically includes one or more keywords, and can be formed, for example, using Boolean logic, or as a phrase, such as by including the search terms in quotation marks. A phrase query requires that two or more terms be located in a particular order within a document. The specificity of a phrase query typically yields a smaller set of more relevant results. Proximity operators used in Boolean logic search queries require two or more search terms to conform to a predefined proximal relationship, for example, a search query may specify that two search terms must occur within five words of each other in a document.
A search engine can evaluate a search query using an inverted index for the collection of documents. An inverted index includes a vocabulary of terms occurring in the documents and an inverted list for each index term. The vocabulary of terms can be arranged in a data structure, such as a B-tree. An inverted list includes one or more postings, where each posting identifies a document in the collection, a frequency of the index term in the identified document, and a list of offsets, which identify positions at which the index term appears in the identified document. For example, a posting in an inverted list for index term t may be configured as follows:
<d, fd,t, [o1, . . . ofd,t]>
where d identifies a document in the collection, f is the frequency of occurrences of the term t in the document d, and o1 through ofd,t are offsets identifying positions of the term t in the document d.
A search engine evaluating a query traverses the inverted lists for each index term included in the query. For example, evaluating a query formed using Boolean logic may require traversing more than one list depending on the operator, such as OR (the union of component lists), AND (an intersection of component lists), SUM (the union of component lists), or a proximity operator (an intersection of component lists).
Evaluating a phrase query can be achieved by combining the inverted lists for the query terms to identify matching documents. However, the process can be slow, especially if the phrase includes one or more common (frequently occurring) words, which typically have large inverted lists.
Alternatively, an auxiliary index can be used, for example, an inverted index that indexes common terms and nextword pairs, such as the nextword auxiliary index described by D. Bahle, H. E. Williams and J. Zobel in Efficient Phrase Querying with an Auxiliary Index, Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, August 2002. This technique requires generating and storing the auxiliary index, which can be 10% of the size of the inverted index, if very few common words are indexed, and up to 200% the size of the inverted index if all firstword-nextword pairs are indexed.
A technique for evaluating search queries including common terms is ‘stopping’, where common terms are identified as stopwords and ignored when evaluating a search query. Ignoring stopwords can speed up the evaluation process, since fewer inverted lists need be found and retrieved from disk, and then processed. However, ignoring search term, particularly in a phrase query, can compromise search results and may be unacceptable in some applications.