The following disclosure relates to techniques for indexing terms included in a collection of one or more documents.
Search engines can be used to locate keywords or phrases in a collection of documents. A search query typically includes one or more keywords, and can be formed, for example, using Boolean logic, or as a phrase, such as by including the search terms in quotation marks. Examples of commonly used Boolean operators include AND, OR and NOT. A phrase query requires that two or more terms be located in a particular order within a document. Proximity operators used in Boolean logic search queries require two or more search terms to conform to a predefined proximal relationship, for example, a search query may specify that two search terms must occur within five words of each other in a document.
A search engine can evaluate a search query using an inverted index for the collection of documents. An inverted index includes a vocabulary of terms occurring in the documents and an inverted list for each index term. The vocabulary of terms can be arranged in a data structure, such as a B-tree. An inverted list includes one or more postings, where each posting identifies a document in the collection, a frequency of the index term in the identified document, and a list of offsets, which identify positions at which the index term appears in the identified document. For example, a posting in an inverted list for index term t may be configured as follows:                <d, fd,t, [o1, . . . ofd,t]>where d identifies a document in the collection, f is the frequency of occurrences of the term t in the document d, and o1 through ofd,t are offsets identifying positions of the term t in the documented.        
A search engine evaluating a query traverses the inverted lists for index terms included in the query. For example, evaluating a query formed using Boolean logic may require traversing more than one list depending on the operator, such as OR (the union of component lists), AND (an intersection of component lists), SUM (the union of component lists), or a proximity operator (an intersection of component lists).
Evaluating a phrase query can be achieved by combining the inverted lists for the query terms to identify matching documents. Alternatively, an auxiliary index can be used, for example, an inverted index that indexes common terms and nextword pairs. ‘Stopping’ is a technique for evaluating search queries including common terms, where common terms are identified as stopwords and ignored when evaluating a search query.
‘Skipping’ is a technique to improve query evaluation performance by including synchronization points (skip entries) in a compressed inverted list, to provide additional locations at which decompressing can commence. Skipping allows a relevant portion of a compressed list to be identified and decompressed, without decompressing the entire list.