1. Field of the Invention
The invention relates to a method of making string searches in computer databases using gram-based inverted-list indexing structures.
2. Description of the Prior Art
In computerized data analysis methods and machines many applications need to solve the following problem of approximate string matching: from a collection of strings, find those that are similar to a given string, or those in another or possibly the same collection of strings.
Previous techniques convert a string to a set of fixed-length grams, and then use the grams to build indexing structures to do a search. The fixed length grams are substrings of a string to be used as signatures to identify similar strings. If a data set has grams that are very popular, these algorithms can and do have a very low performance.
Many information systems need to support approximate string queries: given a collection of textual strings, such as person names, telephone numbers, and addresses, find the strings in the collection that are similar to a given query string. The following are a few applications. In record linkage, we often need to find from a table those records that are similar to a given query string that could represent the same real-world entity, even though they have slightly different representations, such as Spielberg versus Spielburg.
In Web search, many search engines provide the “Did you mean” feature, which can benefit from the capability of finding keywords similar to a keyword in a search query. Other information systems such as Oracle and Lucene also support approximate string queries on relational tables or documents. Various functions can be used to measure the similarity between strings, such as edit distance (a.k.a. Levenshtein distance), Jaccard similarity, and cosine similarity.
Many algorithms are developed using the idea of “grams” of strings. A q-gram of a string is a substring of length q that can be used as a signature for the string. For example, the 2-grams of the string bingo are bi, in, ng, and go. These algorithms rely on an index of inverted lists of grams for a collection of strings to support queries on this collection. Intuitively, we decompose each string in the collection to grams, and build an inverted list for each gram, which contains the ID of the strings with this gram. For instance, FIG. 1 shows a collection of 5 strings and the corresponding inverted lists of their 2-grams.
The algorithms answer a query using the following observation: if a string r in the collection is similar enough to the query string, then r should share a certain number of common grams with the query string. Therefore, we decompose the query string to grams, and locate the corresponding inverted lists in the index. We find those string IDs that appear at least a certain number of times on these lists, and these candidates are post-processed to remove the false positives.
These gram-based inverted-list indexing structures are “notorious” for their large size relative to the size of their original string data. This large index size causes problems for applications. For example, many systems require a very high real-time performance to answer a query. This requirement is especially important for those applications adopting a Web-based service model. Consider online spell checkers used by email services such as Gmail, Hotmail, and Yahoo! Mail, which have millions of online users. They need to process many user queries each second. There is a big difference between a 10 ms response time versus a 20 ms response time, since the former means a throughput of 50 queries per second (OPS), while the latter means 20 QPS. Such a high-performance requirement can be met only if the index is in memory.
In another scenario, consider the case where these algorithms are implemented inside a database system, which can only allocate a limited amount of memory for the inverted-list index, since there can be many other tasks in the database system that also need memory. In both scenarios, it is very critical to reduce the index size as much as we can to meet a given space constraint.