1. Field
The present disclosure broadly relates to the fields of databases and web services, and specifically, to characterizing string similarity.
2. Description of the Related Art
The description of the related art provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in the background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Today, a device measuring string similarity metrics typically determines the computational cost of converting a first string into a second string without considering a distance between the strings due to an insertion or deletion from the middle of a string. Commonly utilized string similarity metrics include metrics such as Edit-Distance, Hamming Distance, and Longest Common Subsequence (LCS).
The Edit-Distance (Levenshtein Distance) string similarity metric is used to determine the computational cost of making the minimum number of edits when converting a first string into a second string. The Edit-Distance metric includes operations of insertion, deletion, and substitution, with each operation having equivalent computational costs regardless of the location of the operation within the string. For example, when the Edit-Distance metric is used to determine the computational cost of converting a first string “abcefdgh” into a second string “abcdgh,” the computational cost becomes two, as the delete operation was performed twice for the deletion of “e” and “f” from the first string. The Edit-Distance metric fails to account for the computational cost of shifting the remaining characters of the first string after the deletion operation.
The Hamming Distance string similarity metric is used to determine the computational cost of converting one string into another string as the number of differing positions between characters of two strings. For example, the Hamming Distance determines that the computational cost of converting between a first string “abcdefg” and a second “abxcdefg” is six, as six characters of the second string, “xcdefg” differ in position from that of the first string. As seen, the Hamming Distance may become large when a single differing character interposes the characters of the first string.
The LCS string similarity metric is used to determine the computational cost of converting one string into another string as the longest subsequence of characters shared between two strings. For example, the LCS determines that the computational cost of converting between a first string “abc” and a second string “acb” is two, as the LCS is either “ab” or “ac.” The LCS does not take into account the distance of the characters between the two strings.
String similarity metrics gain importance when measuring the similarity between, for example, two domain names, nucleotide sequences, devices names, or other strings; however, current string similarity metrics do not account for shifting of the middle characters of a string. Further, because of the nature of such strings in view of the present metrics, an additional metric is needed to more sufficiently characterize strings wherein a middle character is shifted.