1. Field of the Invention
Embodiments of the invention related to data processing. More specifically, embodiments of the invention are related to calculating a measure of similarity for two names, each represented by character strings.
2. Description of the Related Art
In comparing character strings, algorithms are available that measure how “close” two strings are to one another. Typically, such algorithms measure “closeness” based on the number of individually matching characters and on the positional proximity of matching characters. One commonly used algorithm for comparing character strings is the public-domain Jaro-Winkler algorithm for string correlation. The Jaro-Winkler algorithm assigns a score that accounts for the following: length of both strings, percentage of common characters in each string, missing characters, mismatched characters, and letters that have been swapped with one another.
However, when applied to name comparisons, simple string correlation techniques, such as Jaro-Winkler have proven to be inadequate. To properly compare names one needs to not only ensure that the individual words or name elements are similar, but how the entire name is assembled is also important. For example, the name Thomas Joe Allen could easily be altered to be Joseph Alan Thomas and none of the words would match in position. The difference in two names being compared may result from how people write their names in formal versus informal situations, such as “James vs Jimmy” or even from unintentional errors. For example, when filling out a form, someone may write their name as “James, Robert.” If this is incorrectly entered as “James Roberts,” then a simple string comparison fail to match these names. Sometimes individuals may write different permutations of their names in an attempt to hide one's identity. Consider hotel registrations at casino resorts. Sometimes, individuals may be banned from a particular casino. In such a case, the banned individual may attempt to register at the hotel using a false, but similar name to their actual one. In each of these examples, applying conventional string correlation algorithms fails to identify that two names are very similar to one another.
Accordingly, there is a need in the art for a method for assigning a similarity measure to names.