One prior art method of searching for a regular expression is scanning; that is, reading the input text one character at a time, checking for matches. However, as data volume increases, these O(n) scanning search strategies take longer and longer, and index-backed searching algorithms become of greater importance. It is known that regular expression searching can be achieved in sub-linear time (or o(n) time), using a suffix trie. However, suffix tries are normally considerably larger than the text they index.
Suffix arrays and compressed suffix arrays present a more space-efficient alternative to suffix tries. They provide similar functionality while normally occupying less space than the text they represent. In addition, there are compressed suffix array methods that provide for string searches in O(m) time, where m is the size of the string being searched for.
Each entry in a suffix array is an address into the original corpus. Therefore each entry uses log n bits (where the base of the log is 2). Note that the suffix array uses the original corpus during the search procedure. In total, the suffix array structure, along with the original text, is n+n log n bits. Note too that since the suffix array is sorted, rows beginning with a particular string are contiguous. The straightforward way to find the range of rows beginning with a particular string is to use two binary searches. Binary search always runs in O(log n) comparisons, but in this case, the comparisons are string comparisons and could take at worst m character comparisons. Thus, the search complexity is O(m log n). Persons skilled in the art will recognize that, in practice, the suffix array typically occupies 5n bytes since 4-byte pointers are convenient on modern hardware.
The Burrows-Wheeler transform is related to suffix arrays and leads to a kind of compressed suffix array that forms a conceptual n by n matrix M where each row is a rotation of the original text, in lexicographically sorted order. The Burrows-Wheeler transformation takes the last column of this matrix, L. Note that if the text is terminated with an end-of-file character that is lexicographically less than the other characters, the start positions of the strings in M are the same as the suffix array of that text.
The Burrows-Wheeler transform is reversible. The original text can be reconstructed from the L column. Note that every column of the matrix is a permutation of the characters in the original text. Furthermore, the first column, F, contains the characters of the text in alphabetically sorted order. Thus, if L is transmitted, F can be recovered from it by sorting. Assuming F and L, persons skilled in the art can move backwards in the original text. When Occ(ch,r) is defined as the number of times the character ch appears in the L column at or before row r, and C[ch] is defined as the number of instances of characters smaller than ch in the text, which is the same as the index of the first row in F that begins with ch, then for each row, the last-to-first column mapping is LF(i)=C[L[i]]+Occ(L[i],i)−1. This mapping provides a mechanism to step backwards. That is, if row r begins with T[3..] then LF(r) will give the index of the row starting with T[2..]. It is useful to think of the LF mapping as giving the row number of the row starting with the character L[i].
Ferragina and Manzini describe the FM-index, a string-searching index based on the Burrows-Wheeler transformation. Unlike a suffix array, however, the text may be discarded after the index is built. Given the Burrows-Wheeler transform of the corpus, the FM-index takes the L column and divides it into buckets of size b. It groups these buckets into super-buckets of constant size. Each super-bucket stores the number of occurrences since the start of the index for every character. Each bucket stores the number of occurrences since the super-bucket in addition to the compressed contents of its section of the L column. To find Occ(ch, i), where i is the row number, occurrence numbers from the super-bucket and the bucket are added in constant time and then the number of occurrences within the bucket up to row i must be counted while decompressing the bucket (taking O(b) time). Thus, each Occ computation takes O(b) time.
A method of searching text using a FM-index is known as a backward search, which computes the range of rows beginning with a particular string using O(m) Occ computations. Therefore, it takes O(m b) total time in Ferragina and Manzini's implementation.
An FM-index supports queries to find the location of a match. To reduce the size of the index, the FM-index stores location information for only a fraction of the rows. Given a mark character Z, the FM-index stores the offsets for each row in M ending with Z. To find the location of an occurrence in the original text, the FM-index uses the LF-mapping to go back until it finds a marked character. The mark character may be a character occurring with an appropriate frequency or a mark character. When using a specific mark character along with the text, the method must search for abra, aZbra, abZra, and abrZa (where Z is the mark character and the word being searched for is abra). In general, the implementation must search for min(k,m) patterns to do a single count operation.
The FM-Index implementation assumes that the compressed index fits into main memory, which translates into a limitation on the size of the corpus. In an application, larger corpuses must be divided into segments, where each segment is indexed separately, but each index must be queried in a search. Thus, the search time will be linear in the number of indexes. As a result, it is desirable to create large indexes.
When searching an FM-index that is larger than main memory, each operation might require a disk seek. In particular, the main process of using the LF-mapping to go backwards, is a random-access type process, and so each operation might require a disk seek time on the order of 6 ms.
To understand the magnitude of this problem, consider an FM-index built with the suggested parameter k=20 (marking 5% of the characters). Finding the location of a row takes 20*6 ms=0.12 s. Suppose that a user wants the location of 1000 rows (possibly returned by a count operation). Then their query could take about 2 minutes, including time for the count operation. At the same time, a modern hard disk can read data sequentially at around 50 MB/sec. Assuming the hard disk is the bottleneck, 6 gigabytes could be sequentially scanned to find matches in 2 minutes. Thus, in order for this FM-index to be faster than scanning, the collection would have to be larger than 6 gigabytes. As a result, a naive on-disk implementation of the FM-index does necessarily present a practical alternative to a scanning.
It is worth pointing out at this point that a solid state disk could potentially solve this problem. A flash memory “disk” of several gigabytes is relatively low cost and allows fast random access. Since the flash memory does not have a seek penalty, the FM-index implementation would perform much better. However, flash memory is more expensive than hard disks per gigabyte of storage, and the present invention is directed to improving the FM-index to operate better on a hard disk.
U.S. Pat. No. 6,535,642, entitled “APPROXIMATE STRING MATCHING SYSTEM AND PROCESS FOR LOSSLESS DATA COMPRESSION,” discloses a method for compressing data employing an approximate string matching scheme. An encoder characterizes source data as a set of pointers and blocks of residual data. The pointers identify both the number of source data and their location, whereas the residual data identifies the distance between source data. The method compresses using entropy based compression scheme that takes into account the minimum entropy between the source data and the residual data. Text is retrieved by decompressing residual data by starting from a pre-determined offset of the first data block. Text is decoded in a backwards-searching scheme. The present method does not use source data and residual data. U.S. Pat. No. 6,535,642 is hereby incorporated by reference into the specification of the present invention
U.S. Pat. No. 6,751,624, entitled “METHOD AND SYSTEM FOR CONDUCTING A FULL TEXT SEARCH ON A CLIENT SYSTEM BY A SERVER SYSTEM,” discloses a method of searching text from a remote computer using a Burrows-Wheeler transform. After the text is compressed using the transform, the information is sent to the server, which decompresses the information and creates a suffix array. A second user may then search the information on the server. The invention does not address the issue of large data searches. The present method is not limited in this regard. U.S. Pat. No. 6,751,624 is hereby incorporated by reference into the specification of the present invention.
U.S. patent application Ser. No. 10/916,370, entitled “SYSTEM AND METHOD FOR PATTERN RECOGNITION IN SEQUENTIAL DATA,” discloses a method of encoding sequential data. The method generates a symbol feature map that associates a feature with a symbol, and a set of associated statistics. Next, the method creates a set of sieves to sort the symbols. The method then passes a data vector through a selected sieve for processing, and if enough symbols align, stops processing, otherwise moving to another sieve. The present method does not decode symbols by passing data through a set of sieves. U.S. patent application Ser. No. 10/916,370 is hereby incorporated by reference into the specification of the present invention.
Known non-patents include:
“An experimental study of an opportunistic index” by P. Ferragina and G. Manzini, Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, pp. 390-398, 2000.
“When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications” by R. Grossi, A. Gupta, and J. Vitter, Proc. SODA '04, pp. 636-645, 2004.
“Advantages of Backward Searching—Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays” by V. Mäkinen, G. Navarro, K. Sadakane, International Symposium on Algorithms and Computation, pp. 681-692, 2004.
“Fast Text Searching for Regular Expressions or Automaton Searching on Tries” by R. Baeza-Yates and G. Gonnet, Journal of the ACM, vol. 43, no. 6, November 1996, pp. 915-936.
“A Block-sorting Lossless Data Compression Algorithm” by M. Burrows and D. J. Wheeler, Digital Equipment Corporation SRC Research Report, May 10, 1994.
“Compressed Full-Text Indexes” by G. Navarro and V. Mäkinen, ACM Computer Surveys, 2006.