Search engines assist users in locating information from documents, including, for example, web pages, PDFs, word processing documents, source code, text files, images, etc. When searching a large corpus of documents, such as the documents available on the Internet, a search engine may use a range query data structure for efficient substring/phrase searching. A range query data structure is used to represent general integer-to-integer mappings (e.g., a map[i] function) with only slightly more memory usage than a straightforward representation of an array, but the range query data structure dramatically reduces the theoretical complexity of range queries (e.g., enumerate all integers of the set j0<=map[i]<=j1, where i0<=i<=i1). In document searching, the range query data structure enables the use of suffix arrays. Suffix arrays store document strings, including partial strings, that look similar at neighboring places in the array. For example, if a document contains the words “prevent,” “inventions,” “venture,” and “intervention,” these four occurrences of the word stem “vent” will appear in neighboring entries in the suffix array (e.g., as entries of “vent”, “vention,”, “ventions,” and “venture”). This is an efficient way to query the contents of documents. However, for document searching and ranking, it is important to know the position of each string within the document (e.g., where in the document the strings occur in the document). The range query data structure provides this answer by mapping the positions of the suffix array (the i values) to document positions (the j values) in order of appearance within the document.
Some range query data structures use a bitmap binary tree structure for the mapping. The leaves of the tree are the document positions in sorted order (the j values). The nodes of the tree indicate the path to the correct leaf node, with the values in the root node mapping directly to the suffix array positions (the i values). FIG. 2 shows an example of such a tree. The map[i] function 200 takes in a value i from 0 to n (in this example n=15 and may represent an index for a suffix array) and returns an integer j representing the position in the document for the value stored at i. For example, if the suffix entry “vent” is stored in the first suffix array position, represented by an index of zero, the map[i] function 200 indicates that the “vent” string occurs at index 14 in the document. The map[i] function 200 shown in FIG. 2 is not generally stored in memory, but is shown as an example of what the map[i] function is expected to return given a certain input. In the example of FIG. 2, given an index value of six for i the map function should return a j value of one, and given the 3rd position of the suffix array, the map function should return a j value of eight, etc.
Range query structures are used because they reduce the amount of time needed to locate large intervals. For example, in a brute-force lookup (without using the query structure), the time to collect all mapped values into an array is O(m) and the time to sort the array is O(m*log(m)), where m is the size of the range being mapped. Conversely, a range query data structure takes O(m*(2+log(n)−(log(m))), where n is the size of the array and following a single entry to a leaf has a log(n) overhead. When m is large (e.g., m=n), the range query easily outperforms the brute-force method. But not so for smaller intervals because the logarithmic overhead is larger than the savings. Most of the processing time spent in a range query occurs in the lower levels of the binary tree, and as the size of the binary tree grows the number of cache misses (caused by random memory accesses) increases. These cache misses can degrade the performance of the tree, especially for large trees, such as a tree with hundreds of millions of leaf nodes. Furthermore, for some queries (e.g., text searches), the range of values to be mapped is small (e.g., tens or hundreds) compared to the range spanned by the corresponding nodes in the binary tree (e.g., hundreds of millions), making the use of the range query data structure costly for the small range.