The subject matter disclosed herein relates to assessments of string similarity, and more particularly, to techniques for determining a similarity or distance between character strings and providing metrics of similarity or distance.
While numeric data is easy to compare and correlate, the same cannot be said for character strings. For example, techniques to compare strings must assess if cat is similar to bat and how similar both are to chat. The approach used to determine this similarity can be defined by the edit distance or the number of character insertions, deletions, and transpositions to transform one character string to another. This edit distance is also known as Damerau-Levenshtein distance. While this metric produces useful measures of similarity, algorithms traditionally used to compute this distance have a O(nm) running time, where m and n are the length of the input strings. Faster running algorithms do exist that run in linear time, but these algorithms are limited by aspects such as machine word size or limit the range of the input strings they can operate on.