String matching, or the degree of overlap between two strings, is an important component for many data quality processes. In a simple example, two strings may differ in ways such as the perturbation of a single character. For instance, in the strings “Mississippi” and “Mississippe” differ with respect to a single character. However, differences between related strings may be much more complicated. Consider the following examples of source and target strings in the context of matching the source text to the target text.
Source stringTarget strings10 ohm 5% ¼ watt resRES, CF ¼ WATT, 5% 10 OHMRES, CF ¼ WATT, 5% 100 OHMRESISTOR 5% ¼ WATT 10 OHMor,
Source stringTarget stringsChevrolet Division, General MotorsGM, Chevrolet DivGeneral Motors, Chevy Div
In these examples, simple string matching techniques may suggest that the source and target text strings are very different despite the fact that the meaning of the strings is the same. The challenge previously faced is how to determine that these strings refer to the same things, and further, how to scale this solution to very large data sets.
Examples of prior approaches proposed for performing string matching analysis may include deterministic matching and fuzzy matching. Deterministic matching involves a cascading sequence of match scenarios where when a match occurs, the sequence stops. Fuzzy matching algorithms attempt to match two strings typically through the use of a cost or distance function that is minimized. Each of these approaches has disadvantages. Thus there is a continued need for string matching analysis approaches.