1. Field of the Invention
The present invention relates to a method, system and computer program product implementing the method, for processing text search queries in a collection of documents. In particular, the present invention relates to a method, system and computer program product, for processing text search queries which are restricted to a defined document part, for example, to a field such as the title or abstract of a document.
2. Description of the Related Art
The purpose of a text search query is typically to find those documents in a collection of documents that fulfill certain criteria, called search conditions, such as those documents which contain certain words. In many cases, the “relevance” of documents fulfilling the given search conditions has to be calculated as well. Most often, users are only interested in seeing the “best” documents which result from a text search query. Because the size of document collections to be searched is constantly increasing, the efficiency of text search query processing becomes an ever more important issue.
Text search query processing for a fulltext search is typically based on “inverted indexes”. To generate inverted indexes for a collection of documents, all documents are analyzed to identify the occurring words or search terms as index terms together with their positions in the documents. In an “inversion step,” this information is basically sorted so that the index term becomes the first order criteria. The result is stored in a full posting index comprising basically two parts. The first part, also called the dictionary, is a data structure for fast lookup of all index terms that have been encountered during indexing whereas the second part stores all occurrence information as a pool of full posting lists. Each dictionary entry, that is, each index term, contains a reference to a full posting list enumerating all occurrences of the index term in all documents of the collection. Typically, the posting lists are coded and compressed for storing.
FIG. 1 illustrates an example of a collection of documents 100 and a corresponding full posting index 200. The collection of documents 100 comprises three text documents doc1, doc2 and doc3. Each of these documents is a sequence of index terms a, b, c and d. For example, doc1 can be expressed as the following XML-fragment:
<document><title>a<bold>b</bold><title><body>c<bold>b</bold>d<body><document>
The index terms a, b, c and d form a dictionary, that is, the set of index terms which the full posting index 200 is based on. It comprises a full posting list for each index term a, b, c and d, enumerating all occurrences of the corresponding index term in all documents doc1, doc2 and doc3 of the collection. In this example the occurrences of an index term are grouped by document.
For example, the full posting index 200 can be used to process the following query: “find all documents containing the phrase ‘a b’”. Thus, the search engine looks up all positions for “a” and all positions for “b”. Then, the conditions whether “a” and “b” occur in the same document and whether “b” occurs in the position immediately after “a” are checked.
An important feature of text search engines is the ability to restrict searches to certain document parts, for example, fields, such as the title, abstract, body, etc., which are known at indexing time. A field of a document is conceptually viewed as a subset of the positions of the words in a document. Thus, it is possible to define continuous as well as discontinuous fields, for example, the “field” of all highlighted words or the “field” of all figure captions, and also overlapping fields, for example, all highlighted words in the title field. The fraction of all positions that are inside a given field, in all documents of a given collection is called the coverage of the field in the collection of documents.
In order to process queries comprising field restrictions, information about the extent of each field has to be added to the index. It is state of the art to use special posting lists for fields stored in a field posting index. Such a field posting index comprises a set of fields and a field posting list for each field of the set, enumerating the start and end positions of the continuous parts of the field in all documents of the collection.
In the example of FIG. 1 each document of the collection 100 starts with a “title” field, marked by underline, whereas the remainder is defined as “body” field. A third field is defined as “highlighted” and contains all highlighted, that is, bold face, words of the corresponding document.
Consequently, the corresponding field posting index 300 comprises the dictionary entries “title”, “body” and “highlighted” and the corresponding field posting lists enumerate the start and end positions of the continuous parts of each field in all documents of the collection. In this example, the occurrences of a field are grouped by document.
This field posting index 300 can be used to process queries comprising a field restriction, such as, “find all documents containing the phrase ‘a b’ in the title field”. Typically, this is done by first searching for candidates fulfilling the unrestricted query. In this example, one match is found in doc1 from position 1 to position 2. Only then, the search engine checks the positions of the match against the positions stored in the field posting index 300, namely in the field posting list of the field “title”. In this example, the match is contained in the title extent of doc1, hence yielding one hit for the complete query.
The example described above illustrates a general technique to process a query comprising field restrictions. In a first step the corresponding non-restricted query is processed using the full posting index. Then, some form of additional filtering is applied on the result set of the non-restricted query using the field posting index. This additional checking of field restrictions leads to an overall query runtime exceeding the query runtime of the corresponding non-restricted query. In other words, searching entire documents is often faster than searching on defined document parts. This is contrary to the user's expectation that a query on small parts of documents should perform better than a query on entire documents.