Today's large data centers manage collections of data comprising billions of data items. In large collections like these, searching for particular items that meet conditions of a given search query is a task that can take noticeable time and consume a considerable amount of resources. Query response time can be critical in many applications, either due to specific technical requirements, or because of high expectations from users. Therefore, various solutions have been proposed for reducing search query execution times.
Typically, to build a search-efficient data collection management system, data items are indexed according to some or all of the possible query terms. A so-called inverted index of the data collection is maintained and updated by the system, to be then used in execution of every search query. The inverted index comprises a large set of posting lists, where every posting list corresponds to a search term and contains references to data items comprising that search term, or satisfying some condition that is expressed by the search term.
Using as a first example a Web search engine, data items may take the form of text documents and search terms may be individual words or some of their most often used combinations. The inverted index then comprises one posting list per every word present in at least some of the documents. In a second example, a data collection may be a database comprising one or more very long tables, in which data items are individual records, for example lines in a table, having a number of attributes represented by some values in appropriate columns. In this second example, search terms are specific attribute values, or other conditions on attributes and the posting list for a search term is a list of references (indexes, ordinal numbers) to records that satisfy the search term.
FIG. 1 is a simplified illustration of an inverted index in tabular form. A tabular inverted index 2 shown on FIG. 1 is much smaller than those of typical applications, the inverted index 2 is thus greatly simplified for illustration purposes. The tabular index 2 may be applied to both examples of a Web search engine and of a data collection of a database. The tabular inverted index 2 corresponds to 100 documents (not shown) stored in a database (not shown), the documents being numbered from 1 to 100. As shown, the tabular inverted index 2 comprises a header row 4 that defines elements of various columns 16, 18, 20, 22, 24 and 26. The header row 4 may not be present in some actual realizations and is shown on FIG. 1 for illustration purposes. Other rows 6, 8, 10, 12 and 14 each comprise a term in column 16, and a corresponding posting list in columns 18, 20, 22, 24 and 26. In the particular example of FIG. 1, terms of rows 6, 8, 10, 12 and 14 are names of animals that are mentioned in several of the 100 documents of the database. Each posting list comprises a first document reference in column 18 and may comprise additional document references in columns 20-26. Considering for example row 6, the term “dog” is found in documents number 25, 35, 47, 65 and 83 of the database. The first document reference placed in row 6, column 18, may be an absolute document number (25) or a first delta reference indicating a difference between the absolute document number and a 0th document number, this first delta reference being equal to the absolute document number. A second document reference is placed in row 6, column 20. The second document reference may be stored as an absolute document number (35). Alternatively, the second document number may be stored as a second delta reference, indicating a difference (10) between the second document number (35) and the first document number (25); to use delta references, document reference numbers are stored in the posting lists in ascending order. Likewise, a number of a third document comprising the term “dog” may be stored as an absolute document number (47) or as a third delta (12) between the third document number (47) and the second document number (35).
Using delta references requires less memory space for storing the tabular inverted index 2 since, on average, data elements of the tabular inverted index 2 are smaller and can be encoded with fewer bits. Because a difference between two absolute document numbers will always be at least equal to one (1) or greater, additional space may be saved by recalculating delta references as differences between absolute document numbers minus one (1). Using this manner of calculating delta references, all numbers between parentheses of the inverted index 2 would be decremented by one (1). For example, the first delta reference on row 8, for the term “horse”, would be decremented from “8”, which requires four (4) bits for encoding, to “7”, which can be encoded with only three (3) bits.
The illustrated tabular inverted index 2 provides references for five (5) distinct animal names that may be searched among the 100 documents of the database. Accordingly, the highest document reference number does not exceed 100. It may be observed that a distinct terms may be found in the same document, for example “dog” and “horse” being both found in document number 25, and that terms that refer to rare animals are found in fewer documents.
FIG. 2 is a simplified illustration of an inverted index in single vector form. Information elements of the tabular inverted index 2 of FIG. 1 are reproduced in a single vector inverted index 30 of FIG. 2; some additional elements have been added for illustration purposes. The single vector inverted index 30 is built in a similar fashion as the tabular inverted index 2 of FIG. 1, except that terms and corresponding posting lists are placed on a continuous vector, a second term (horse) following a posting list for a first term (dog) so that no position needs to remain empty, as in the case for example of the last few columns of rows 10-14 of FIG. 1.
A query of documents that contain a particular term may be executed by first finding that particular term in the inverted index 2 or 30 and by fetching the relevant documents using the corresponding posting list. To speed up execution of search queries, the inverted index is typically stored in a fast memory, for example in Random Access Memory (RAM) of one or more computer systems. Documents or other data items themselves may be stored on a larger but slower storage media, for example on magnetic or optical disks or other similar large capacity devices. In this way, processing of a search query implies looking up through one or more posting lists of the inverted index in the faster memory, rather than through the data items themselves.
Typically, documents or other data items of a searchable information base are listed in the inverted index as integer reference numbers. For some applications, a range of document numbers may be in a range from one billion to several billions. Some words that may be used as search terms may be located in very large numbers of documents, for example in millions of documents. Consequently, an inverted index may comprise millions of searchable terms, each of these terms being associated with a potentially long posting list. It follows that there is a need, in various computer applications, to represent very long lists of symbols or codewords, for example document reference numbers, in compressed form and to store these long lists in fast computer memory for efficient access and processing.
In many applications, storing of documents in a database and updating of an inverted index is performed as a background application. This may for example be the case of so-called Webcrawler applications that automatically browse through the Word Wide Web to accumulate information into a database of a Web search engine. For these applications, speed is of secondary importance while effectiveness of compression of information in the inverted index is more important. In contrast, decompression speed is more important since a user of a Web search engine or of a database system may require fast response to her search queries.
It can be seen from FIG. 1 and from FIG. 2 that terms that may be found in a large number of documents are associated with long posting lists that, in turn, contain small reference numbers (small integers) when delta references are used. A posting list of small integers may be subdivided into short blocks, these blocks then being compressed for compact storage of the inverted index in memory. Ideally, all elements in a short block would be of a same length in the sense that they would be coded with a same number of bits. Use of same-length coding of elements in a block allows using computer systems having low parallelism levels, as in the case of single instruction multiple data (SIMD) processors. For example, if the processor has a subset of SIMD instructions capable of being performed on eight (8) different data elements in parallel, it would be beneficial to represent every long list of symbols as a sequence of blocks, where every block contains or represents exactly eight (8) symbols. In fact, some processors are capable of executing SIMD instructions on as many as 32 or even 128 data elements in parallel; blocks or 32 or 128 consecutive integers, representing reference numbers (or delta reference numbers) could be efficiently handled by such processors as long as their 32 or 128 elements are of equal lengths.
However, coding all elements of a block in a same number of bits may be inefficient in terms of compression. For example, seven (7) elements of a block might be codable on three (3) bits while another element of the block may require five (5) bits for coding. Coding all eight (8) elements of the block on five (5) bits each would not attain an optimal level of compression. A list coding method called “Patched Frame of Reference” (PFor) proposes to code smaller elements of a block on their optimal number of bits (in the current example, on 3 bits each), while moving out the larger element(s) into a separate list of exceptions, called “patches”, which are coded on more bits. Every patch position in a “main” block is filled with a number of a next patch position relative to this one, thus making up a chained list of patch jumps across the block. A block header contains a first patch position number relative to the beginning of the block, as well as a number of bits used for every smaller element bits (3 bits in the current example) placed in their original position in the block and the number of bits (2 bits in the current example) for every “patched” larger element.
It has been found that PFor works reasonably well for medium-sized blocks, for example for blocks of length 128 elements. However, PFor does not provide for sufficient parallelism in list decoding, because the chained list of relative patch jumps must still be retrieved sequentially, and converted into absolute patch positions within the block. Also, there may be cases when a relative jump from one patch position to the next one is too long for being coded “inline” in as many bits as used for every inline element of the block. In those cases, a fake patch position must be introduced, to split the jump into two shorter ones.
According to a modified PFor method called “NewPForDelta” (NewPFD), the least significant bits of the appropriate patch value (3 least significant bits in the above example) stands in every patch position while the remaining bits are coded apart (2 remaining bits in the above example). The whole representation of a block thus consists of three (3) lists appended to each other, including (i) a main list comprising smaller elements along with least significant bits of larger elements in patch positions, (ii) a list of the remaining portions of the larger elements, and (iii) a chained list of jumps from one patch position to another.
The NewPFD method, however, still does not provide for sufficient parallelism in list decoding. Hence, a further modification of the PFor method called “Parallel PFor” (ParaPFor) replaces the chained list of relative positions of patches with their absolute position numbers relative to the beginning of the block. For example, in a 32-element block, every patch position has a number from 0 to 31 and thus coded on five (5) bits. This list of patch positions can be unpacked in a few parallel SIMD threads, at the same time as the main list and the list of higher bits of the patches. Finally, a parallel element-wise add operation can be performed, yielding the whole unpacked block.
The ParaPFor method may be demonstrated with the following example: Let us consider an 8-element block [3,2,4,1,0,1,5,2], each element representing for example a delta reference in a posting list. In binary representation, the block becomes [11,10,100,1,0,1,101,10]. Elements are numbered from e0 to e7, from left to right. The block has three (3) elements of 1-bit length (3rd, 4th and 5th elements), three (3) elements of 2-bits length (0th, 1st and 7th elements) and 2 elements of 3-bits length (2nd and the 6th elements).
FIG. 3 shows a data structure of an uncompressed posting list block. A block is composed of blocks of eight (8) elements each. The above mentioned block is the block number f in the posting list; its content is schematically shown at 40. A length of an element of the block f denotes a number of bits minimally necessary for its binary representation; such a shortest representation of an integer may be called its “canonical representation”. When an integer is coded on a greater number of bits than is necessary for its canonical representation, it is padded with non-significant binary 0's in the high (left) positions. Block f comprises a header byte 42 and three (3) data bytes 44, 46 and 48. The header byte 42 shows that all elements of the block f are coded with a length l of three (3) bits. This length is sufficient to code the longest elements e2 and e6 of the block f; other elements of the block f carry non-significant padding bits. FIG. 3 therefore shows a “non-patched” encoding of the block f. Because the length l is equal to three (3) bits and because the block f comprises eight (8) elements, a total length of the block f in is equal to four (4) bytes, i.e. l+1 expressed in bytes.
Continuing with the ParaPFor method, a method defines a base length b, in number of bits, of shorter, or “inline” elements. Elements that cannot be encoded within b bits become truncated values that are also placed inline in the compressed block. The method also defines exceptions, or patches, for elements of the block that are longer than b bits. A modified header for the block specifies the base length b and positions of patches (“patch positions” p1, p2), on three (3) bits each. Higher bits of values of every exception (“patch values” v1, v2), representing a difference between the actual values of the uncompressed blocks and truncated values of the compressed block, are separately encoded, before or after the inline element values.
FIG. 4 shows a definition of patches for the posting list block of FIG. 3. The block f of FIG. 3 is schematically represented as 50 comprising in a body row 52, a row 54 of patch values and a row 56 of patch indicia. A base length b is equal to two (2) bits and a body of the block f consists of the two (2) inline bits of every element, in which elements e2 and e6 are truncated. Values of patches on row 52 are either “0” in no-patch positions and “1” in patch positions. It is observed that in the particular example shown herein, patch values are limited to a maximum of one (1) since no value requires more than three (3) bits for encoding and the base length b is equal to two (2) bits. Row 56 indicates that there are two (2) patches in positions pi equal to 2 and 6 and that their values vi are both equal to one (1).
FIG. 5 shows a conventional manner of encoding the posting list block of FIG. 3 with the patch definitions of FIG. 4. The block f is now compressed as shown at 60. Using ParaPFor encoding, a header 62 of the block f specifies a total number n of patches, varying for example from 0 (no patches) to 2 or 3 patches. The header 62 also contains a length d of every patch value in the block; it may be assumed that all patches have the same length d, which is a length of the longest patch value. The header 62 then contains patch positions p1, p2, . . . and patch values v1, v2 . . . for the n patches. Inline values e0-e7, including truncated values where applicable, are appended in the compressed block f 60 after the header. If any given field is not sufficiently large to fill a position of the compressed block f 60, that field is passed with non-significant zero bits; this is applicable to header values and to inline values.
The ParaPFor method provides for just slightly lesser compression than the original PFor or the NewPFD, but gains in higher decompression speed on a specialized processor architecture with an appropriate SIMD parallelism factor, such as for example with 32-thread parallelism that can be efficiently used on the NVIDIA™ GTX480 graphical processor. ParaPFor and can thus be considered as offering a reasonably good tradeoff between compression factor and decompression speed on such computer systems. There exist however a large family of general use processors, commonly denoted as the “x86 family”, comprising devices from Intel™, AMD™ a few other manufacturers that are widely used in various computer server architectures including very powerful multiprocessor servers. Modern processors of the x86 family are equipped with the so-called “Streaming SIMD Extensions” (SSE) set of instructions, providing for parallel execution of same operations on a bank of 8 “short integer” 16-bit registers. This makes it possible to achieve an 8-thread SIMD parallelism on every processor in a server.
For such an 8-thread SIMD architecture, however, the PFor compression method or its known enhancements including ParaPFor do not provide for an optimal balance between compression density and decompression speed ratio. This is because in a block as short as 8 elements, explicit indication of every patch position becomes inefficient in terms of compression ratio, as compared with a simple enumeration of patch position combinations. Also, repetitive operations of extracting one or more patch position numbers from a block header take time and are processor intensive.
Hence it would be beneficial to have a list compression method providing for yet further improvements both in terms of compression density and of decompression speed. Such improvements would be particularly valuable when using computer architectures with 8-thread SIMD extensions, or in other similar configurations.
In a more general context, any further progress in terms of denser compression and faster decompression of long lists would indeed be beneficial, and every new list compression scheme providing a substantial gain in at least one of the above parameters without introducing a substantial loss in the other one would be beneficial.