This disclosure relates generally to data analysis in a data processing system and more specifically to data analytics of character strings using descriptors in the data processing system.
A typical problem is an apparent lack of tools or methods to reliably find patterns among words in a portion of text. The problem is compounded by a further requirement to accommodate a level of variability and keyboard typing error tolerance associated with the respective characters comprising the text string.
Currently there are diverse string comparison methods some of which use forceful methods during comparisons. A comparison typically comprises comparing character by character of each word in the words in a text string. The words are compared to verify whether characters match and not whether the characters are “close” to one another. One example string comparison calculates a “distance” between two strings of equal length as a number of positions at which the corresponding symbols of the stings being compared are different, as in a Hamming distance
In another example a comparison is performed in which each character of a first string is compared with corresponding matching characters of a second string. In this example a number of matching characters, which have a different sequence order, are divided by 2 to further define a number of transpositions of the characters. This example may be referred to as a Bonacci distance, a variant of a Jaro-Winkler distance.
In another example, a Levenshtein distance, is a measure of a difference between two string sequences calculated as a minimum number of single-character edits, comprising insertions, deletions or substitutions, required to change a first word into a second word.
In current solutions, there are limitations associated with typing error tolerance and a lack of support for a desired amount of variability in matching.