1. Field of the Invention
The present invention relates to a compression method of an inverted index for performing high-speed full text search on a large-scale document set, and also relates to a search method by use of the inverted index.
2. Description of the Related Art
In the modern society, an enormous amount of electronic documents are generated every day, and the amount is continuously increasing. They are essential for daily business. The expansion of the Internet is one factor regarding the rapid increase in electronic documents. Along with the increase in the amount of electronic documents, document search techniques for searching for useful information amongst huge amounts of documents in a short period of time have become a necessity, and technical developments thereof have advanced.
FIG. 1 is a schematic view of a document set that consists of search targets, and an inverted index constructed on the basis of the document set. A typical document search style technique searches for documents including given keywords, out of a given document set 101 in a minimum time period. A data structure called an inverted index 102 is used for this purpose. When a word 103 occurs in a document 104, a pair composed of a document identification number 105 (referred to as docID in the rest of this specification) and a number of occurrences 106 (referred to as frequency in the rest of this specification) of the word in the document is referred to as a posting 107. A data structure storing a list of all postings on every word in the search target document set is called the inverted index 102.
Although in postings frequencies are sometimes omitted depending on the application, the inverted index of the present specification includes frequencies. This is because they are often required to calculate an importance score of the retrieved document. Meanwhile, although a posting sometimes additionally includes locations where the corresponding word occurs, the technique of the present specification is applied to a portion of postings consisting of a docID and a frequency.
Unlike languages such as English, French and Spanish where words are separated with spaces, in Japanese, Korean and Chinese, a sentence is difficult to be divided into words in an accurate manner. Hence, in some cases an inverted index is constructed by using, instead of a word, an arbitrary substring consisting of n contiguous letters called an n-gram which is known to be effective (See “Jyouhou kensaku arugorizumu (information search algorithm)” Kenji Kita and others, Kyoritsu Shuppan co., ltd). Here, n is mainly an integer of 1 to 10. A word and a substring having a length n are referred to as an indexing term in the present specification.
A docID and a frequency constituting a posting are each represented by an integer value. In the case where an integer value is represented as a byte sequence of fixed length without being compressed, 4 bytes are necessary to represent a single integer value if a numeric value is represented in 32 bits. However, there are several known techniques by which a numeric value is represented in less than 32 bits. The Elias gamma code (Non-patent Document 1) is a method by which a small numeric value can be represented in a small number of bits, and is appropriate for compressing the frequency an indexing term. Although the Elias gamma code can be similarly used for compressing a docID, the Elias delta code (Non-patent Document 1) is more appropriate since specification numbers are not as concentrated to small numbers as are the occurrence frequencies. An integer value can also be compressed by use of a variable-length bit sequence in the Golomb code (Non-patent Document 2) and the Rice code (Non-patent Document 3), of which the processing speed and compression rate are said to be better than Elias gamma and delta codes.
There is also a known method in which an integer value is represented in a byte sequence of variable length. In variable byte method (Non-patent Document 3 and 4), a numeric value is represented in the binary code, and every 7 bits are stored in a byte sequence 202 of variable length. FIG. 2 is a diagram showing an example of an integer value compressed by the variable byte method. The most significant bit 201 is used to determine whether or not the 1 byte including this bit is at the end of the byte sequence. The byte of which the most significant bit 201 is 1 is not a byte at the end of the byte sequence 202. In contrast, a byte of which the most significant bit 201 is 0 is at the end of the byte sequence 202. The meaning of 0 and 1 indicated by the most significant bit 201 may be reversed. The example in FIG. 2 shows a case of compressing an integer 123456 (0000111 1000100 1000000, in binary number representation) by the variable byte method. Here, since the integer can be represented in 3 bytes, 3/4=75% compression is accomplished if the original integer 123456 is recorded in 32 bits. Although encoding 123456 is performed sequentially from left to right (the most significant bits to the least significant bits) in the example of FIG. 2, encoding may alternatively be performed from right to left (the least significant bits to the most significant bits). The merit of the variable byte method lies in the fact that high speed processing is made possible because of a simple decoding process.
A technique is known in which encoding is performed not only on a byte sequence, but performed at a time such that multiple postings are packed into a computer word that consists of 32 bits or 64 bits of computer words (Non-patent Document 5). By extending the technique in Non-patent Document 5, a technique is proposed in which encoding is performed across boundaries of computer words (Non-patent Document 6). Other known techniques for compactly representing an inverted index for full text search are disclosed in Patent Documents 1 and 2. Moreover, Non-patent Document 7 discloses a technique for retrieving postings related to a specific document in an inverted list at high speed, by dividing a list of postings prepared for each word, that is, an inverted list 108 (FIG. 23), into multiple blocks of the same size, and by performing a binary search thereon.    [Patent Document 1] Japanese Patent Application Laid-open Publication No. 2001-312517    [Patent Document 2] Japanese Patent Application Laid-open Publication No. Hei 6-243009.    [Non-patent Document 1] P. Elias, Universal codeword sets and representations of the integers, IEEE Transactions on Information Theory, IT-21(2):194-203, 1975.    [Non-patent Document 2] S. Golomb, Run-length encodings, IEEE Transactions on Information Theory, IT-12(3):399-401, 1966.    [Non-patent Document 3] E Scholer, H. E. Williams, J. Yiannis and J. Zobel, Compression of Inverted Indexes for Fast Query Evaluation, Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 222-229, 2002.    [Non-patent Document 4] H. Williams and J. Zobel, Compressing integers for fast file access, Computer Journal, 42(3):193-201, 1999.    [Non-patent Document 5] V. N. Anh and A. Moffat, Inverted Index Compression Using Word-Aligned Binary Codes, Information Retrieval, 8(1):151-166, 2005.    [Non-patent Document 6] V. N. Anh and A. Moffat, Improved Word-Aligned Binary Compression For Text Indexing, IEEE Transactions on Knowledge and data engineering, 18(6):857-861, 2006.    [Non-patent Document 7] I. H. Witten, A. Moffat, and T. C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. San Francisco: Morgan Kaufmann, 1999.