A string metric or string distance function is a metric that measures distance (“inverse similarity”) between two text strings for approximate string matching. A string metric provides a number indicating an algorithm-specific indication of distance. The most widely known string metric is a rudimentary one referred to as the “Levenshtein” distance. String matching may be used in a variety of applications including data quality, searching, clustering, or other approaches to data analysis. In a simple example, two strings may differ in ways such as the perturbation of a single character. For instance, the strings “Mississippi” and “Mississippe” differ with respect to a single character. However, differences between related strings may be much more complicated.
String similarity search or analysis uses a string metric and can be thought of as the problem of searching through a list of string values (e.g., people or place names, controlled vocabularies, etc.) for a given input string. Known solutions for string similarity search have generally involved repurposing general-purpose search engine technology or employing database technology for searching values in a given column. The former is better suited for indexing and retrieving documents and web pages and can have sub-optimal search results for the string similarity task because they rely heavily on prefix filtering. The latter is well suited for the task, generally speaking, except that it introduces latency by requiring a round-trip to a database server. In general, both of these known solutions introduce latency which makes a search-as-you-type (i.e., search results update as a user types the input query) use case much more challenging.