1. Field of the Invention
The present invention relates to computer-based electronic Information Retrieval (IR). In particular, it relates to an electronic Information Retrieval (IR) method and system, having an indexer module using an inverted index comprising potential search items and associated posting lists.
2. Description and Disadvantages of Prior Art
The basic structure and function of prior art IR systems is illustrated in FIG. 1.
The system includes amongst other elements with minor relevance for the present invention a search engine comprising a web crawler module 10, a parser or tokenizer module 12, an indexer module 14, an index storage 16 storing data according to a logical scheme comprising search items as mentioned above; the system further includes a ranking module 18, a search module 20, and finally a client which issues queries and receives results from the IR system.
In particular, a search pool of documents (Internet or others) is crawled independently of user queries, and crawled documents are indexed by a data structure, for instance the before-mentioned “inverted index”, comprising in each row an index entry composed of a potential search item and an associated posting list containing document-identifying information, saying in which document a search item is found and including optionally further information on the location within a respective document, where said search item occurs. The search server 20 accesses a copy of the index 16, see the arrow.
FIG. 2 depicts a “sectional” view on two single entries within the before-mentioned inverted index data structure. The left column defines the so-called vocabulary and comprises possible search items 22 like for example “IBM”, or “SERVER”. The right column is known as posting list 24. A posting list entry 26 for a search item includes:    a) a document-identifying information, for example a number, or URL and optionally further information like    b) an offset from the beginning of a respective document.For “IBM” for example the first entry in the posting list relates to document ID 0003, page 52 thereof and line 13 thereof. The other references and entries in the posting list depicted in FIG. 2 are to be interpreted similarly.
With respect to the particular focus of the present invention, a general issue of prior art Information Retrieval (IR) systems as mentioned above is the size of their data structures, e.g. the dictionary or “vocabulary” entries, i.e., the left column in FIG. 2. When data items thereof are too big, the system suffers the drawback of low data cache hit rate and high I/O traffic between the system memory and the CPU forming a well known performance bottleneck. In worst case, the third stage forming part of an even longer and narrower bottleneck is consequently disk input and output (I/O), since count and size of data items can exceed the available hardware-sided memory. IR systems and search engines are used to compute a very large number of particular data items like dictionary entries, posting list entries and statistical information related thereto. So, with increasing use of this bottleneck during the query execution as depicted in FIG. 1 the performance thereof slows down intolerably.
So, basically every approach of shifting resource consumption from the memory and I/O subsystems to the CPU, to avoid intensive bottleneck usage is welcome in general, since CPU speed is increasing at a higher rate than the memory or I/O subsystem bandwidth. One such prior art approach includes the general idea to reduce disk I/O by the compression of the data items in memory before they are written to disk, see I. H. Witten, A. Moffat, T. C. Bell: Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition, Morgan Kaufmann, Inc. 1999.
But this approach suffers from the drawback that the data must be read back to memory for decompression. This requires additional memory and CPU cycles. This offsets at least partially the savings of disk I/O. Thus, actually this is not a satisfying solution for the bottle-neck problem described above.