Search engines provide important tools for retrieving information from digitalized text documents. They may be used in stand alone mode or as components of more complex information retrieval software solutions, e.g., of text mining or internet portal software. Because of the strong growth of the amount of digitalized text data to be searched, excellent performance and scalability features are essential for search engines, especially during query operations.
The fundamental data structure of search engines is based on indexed collections of text documents. Before applying search queries on a text database or a collection of text documents, an indexing process is performed. During such an indexing process each text document of the database of text documents is analyzed in order to identify search terms. As a result one obtains an assignment between a text document and distinct search terms that are contained in the document. This assignment is preferably inverted in form of a posting list for each search term. Typically, a posting list for a given search term contains a list of document identifiers corresponding to the documents containing this particular search term.
FIG. 1, shows a block diagram of a list of conventional text documents, a dictionary, and a corresponding set of posting lists.
Each text document 10, 12, 14, 16 comprises a list of words. For example, document 10 has the words: “computer”, “bit” and “byte”. Document 12 has the words: “memory” and “byte”, etc.
The dictionary 20 has an entry 22, 24, 26, 28 for each single word appearing in one of the documents 10, 12, 14, 16. For example the word “bit” is in document 10 and in document 14. Hence, it appears twice in the list of documents 10, 12, 14, 16.
The entry 22 of the dictionary 20 indicates that the word “bit” appears twice in the list of documents 10, 12, 14, 16. Similarly, the word “computer” appears three times as indicated by the entry 26 of the dictionary 20.
The posting lists 30, 32, 34, 36 represent an inverted dictionary for each single word that appears in the list of documents. For example, posting list 32 indicates that the word “bit” appears in document 10 and in document 14, as indicated by the corresponding document identifiers that are stored as list entries in the posting list.
The posting list 36 indicates that the word “computer” appears in the documents 10, 14, and 16 and therefore points to these documents. Performing a search query is typically based on such posting lists thus enabling an efficient and fast processing of a search queries.
In general, the posting lists are compressed to save disc space and to reduce input and output (IO) traffic. The structure of these compressed posting lists, and the performance of an associated decoding or decompressing algorithm, are critical for the query response times.
The posting lists of search engines contain at least the document identifier and possibly even the position of a search term within the document. Additionally, other data associated with search terms may be stored in the dictionary.
An approach for compressing a posting list is for example given by the delta encoding procedure. When some search term appears in six documents of the indexed collection of text documents and these documents are for example numbered 4, 6, 9, 12, 48, 70, the corresponding search term can then be described in the simplest case by an inverted file, associated with the following posting list: (4, 6, 9, 12, 48, 70). Because such a list is in ascending order, the list can be stored as the initial position followed by a list of the differences between the current and a successive element of the list. Applying such a delta encoding procedure for the above mentioned list would result in: (4, 2, 3, 3, 36, 22).
The advantage of such a representation is that on average substantially fewer bits per list element are necessary to encode it. Especially, when the numbers of a posting list corresponding to the document identifiers become rather large, these numbers may require 16 or even more bits of disc space in order to be stored in an un-encoded way. Therefore, when a posting list contains numerous document identifiers, storing the difference between successive document identifiers of the posting list appreciably reduces the required disc space.
Upon application of a delta decoding procedure, it is also possible to selectively decode only designated list entries or parts of the list rather than applying the decoding procedure to the entire list. Usually, delta decoding as well as delta encoding techniques are supplemented by methods to provide effective means for selectively decoding and encoding particular list entries.
In order to exploit the advantages of a delta encoding procedure, it is reasonable to store the list entries of an encoded posting list in buckets of variable size depending on the number of bits to be encoded. Regarding the above mentioned list, each of the first four list entries could be stored by a 2 bit bucket and the last two entries could be stored by a 5 bit bucket. Since the compressed delta encoded posting list has to be decoded, it is of advantage to limit the number of different buckets because the decoding of list entries of different buckets usually requires a particular decoding routine. It is therefore of practical use to store the list entries of a posting list in buckets with e.g., 4 bits, 8 bits, 16 bits, etc.
This allows the storage of list entries of various sizes by means of a discrete number of buckets. For example, making use of an ensemble of three buckets with 4 bit, 8 bit and 16 bit, the 4 bit bucket is used for storage of list entries having a size smaller than or equal to 4 bits. The 8 bit bucket is used for storage of list entries requiring between 5 and 8 bits of storage size and the 16 bit bucket is appropriate to store entries having a size between 9 and 16 bits.
Making use of encoded posting lists featuring several buckets for the posting list entries on the one hand reduces the overall size of the posting list but on the other hand it requires an increase of operations in order to decode the list entries of a posting list.
In the simple case of sequentially decoding a complete posting list the decoding algorithm for a single list entry may look as follows:                get size of current index entry,        if the size of the current index entry is smaller than 4 bit, then decode current position by a 4 bit decoding routine,        else, if the size of the current position is smaller than 256, then decode the current list entry with a 8 bit decoding procedure,        else, if the current list entry is smaller than 65536, then decode the current list entry with a 16 bit decoding procedure.        
This example illustrates, that a multiplicity of “else if” statements has to be performed in order to decode a single list entry appropriately.
Because the decoding process described above has to be performed on at least parts of an entire posting list that may contain millions of entries, the process of query execution becomes extremely time critical and every instruction saved in the decoding procedure will result in a significant decrease of the query response times.
The present invention therefore aims to provide a method of enhancing decoding performance of text indexes.