It is desirable that data storage use as little space or memory as possible. To this end, mechanisms for lossless data compression have been developed. Classic Huffman coding, for example, refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. Huffman coding was advanced by wavelet trees, as described in Grossi, Roberto, et al., HIGH-ORDER ENTROPY-COMPRESSED TEXT INDEXES, Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2003. A wavelet tree, as described by Grossi et al., is a data structure that uses character level compression, like Huffman coding, to represent data in a compressed and tree-based data structure that enables searching. Such wavelet trees were again advanced by word-based wavelet trees, which encode words instead of characters in the tree, as described in Brisaboa, Nieves R., et al., A NEW APPROACH FOR DOCUMENT INDEXING USING WAVELET TREES, 2007. The teachings of this article are hereby incorporated by reference.
Word-based wavelet trees operate as follows. The frequency of words or phrases is counted. Each word is assigned a byte string according to its frequency. The most common words are replaced with the shortest byte string, i.e. a single byte. Less common words are replaced with longer byte strings, i.e. 2 or more bytes. Additionally, each byte string uses end tagged dense code (ETDC), which is a compression method that assigns byte codes to words, where the last byte of each byte code is used as an “end tag” by marking its first bit to “1.” The use of ETDC makes random access in compressed byte strings possible. This is as opposed to bit based Huffman coding where random access is not possible because it is unknown where the encoded characters start and stop.
Although word-based wavelet trees are an improvement over the prior art, their indexing is still not optimally efficient. Current self-indexers include only the basic rank, select, display, locate, and count functions described in Brisaboa, et al. Therefore there is a need for a word-based wavelet tree self-indexer that optimizes the indexing of word-based wavelet trees by including functions beyond these basic functions.