Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures specifically tailored for document retrieval purposes. Such similarity measures include TF/IDF (term frequency/inverse document frequency), a statistical measure used in information retrieval and text mining for evaluating how important a word is to a document in a collection or corpus; BM25 (also known as “Okapi BM25”), a ranking function used by search engines to rank matching documents according to their relevance to a given search query developed in the 1970s and 1980s by Stephen E. Robertson, Karen Sparck Jones, and others; and HMM (hidden Markov model) which is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and hidden parameters are determined from the observable parameters.
As new implementations of retrieving short strings are becoming popular (e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps), new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token or string. Incorporating updates, in the prior art, is usually accomplished by rebuilding the indexes at regular time intervals.
The prior art is mainly concerned with document retrieval speeds, however, given that queries often contain spelling mistakes and other errors, and stored data have inconsistencies as well, effectively dealing with short strings requires the use of specialized approximate string matching indexes and algorithms. Although fundamentally documents are long strings, the prior art, in general, makes assumptions which are not true when dealing with shorter strings. For example, the frequency of a term in a document might suggest that the document is related to a particular query or topic with high probability, while the frequency of a given token or word in a string does not imply that a longer string (containing more tokens) is more similar to the query than a shorter string. Or the fact that shorter documents are preferred over longer documents (the scores of short documents are boosted according to the parsimony rule from information theory) conflicts with the fact that in practice for short queries the vast majority of the time users expect almost exact answers (answers of length similar to the length of the query). This is compounded by the fact that for short strings length does not vary as much as for documents in the first place, making some length normalization strategies ineffective. Moreover, certain other properties of short strings enable us to design very fast specialized approximate string matching indexes in practice.
In many applications it is not uncommon to have to execute multiple types of searches in parallel in order to retrieve the best candidate results to a particular query, and use a final ranking step to combine the results. For example, types of searches include: almost exact search versus sub-string search, ignore special characters search, full string search or per word search, n-gram (where ‘n’ is the length of component strings in which the data is broken into for indexing and may be, for example, 2-grams, 3-grams, 4-grams etc.), and edit distance versus TF/IDF search.
Recently, M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava in IEEE (Institute of Electrical and Electronics Engineers) International Conference of Data Engineering (ICDE), “Fast indexes and algorithms for set similarity selection queries”, designed specialized index structures using L2 length normalization that enable retrieval of almost exact matches with little computational cost by using very aggressive pruning strategies. Nevertheless, the drawback of this approach is that the indexes are computationally expensive to construct and they do not support incremental updates. Generally speaking, even though various types of length normalization strategies have been proposed in the past, approaches that have strict properties that can enable aggressive index pruning are hard to maintain incrementally, while simpler normalization methods are easier to maintain but suffer in terms of query efficiency and result quality, yielding slower answers and significantly larger (i.e., fuzzier) candidate sets.
A key issue to deal with in a real system is that data is continuously updated. A small number of updates to the dataset would necessitate near complete recomputation of a normalized index, since such indexes are sensitive to the total number of records in the dataset, and the distribution of terms (n-grams, words, etc.) within the strings. Given that datasets tend to contain tens of millions of strings and that strings could be updated on an hourly basis, recomputation of the indexes can be prohibitively expensive. In most practical cases, updates are buffered and the indexes are rebuild on a weekly basis. Index recomputation typically takes up to a few hours to complete. However, the online nature of some applications necessitates reflecting updates to the data as soon as possible. Hence, being able to support incremental updates as well as very efficient query evaluation are critical requirements.
In N. Koudas, A. Marathe, and D. Srivastava, “Propagating updates in SPIDER” (which may be found on pages 1146-1153, 2007 ICDE), two techniques were proposed for enabling propagation of updates to the inverted indexes. The first was blocking the updates and processing them in batch. The second was thresholding updates and performing propagation in multiple stages down the index, depending on the update cost one is willing to tolerate. That work presented heuristics that perform well in practice, based on various observations about the distribution of tokens in real data, but it did not provide any theoretical guarantees with respect to answer accuracy while updates have not been propagated fully.
Thus, there remains a key problem of inefficiency, regarding length normalized index structures for approximate string matching, in large part, due to data dependent, normalized weights associated with each distinct token or string in the database.