This specification relates to measuring the similarity of objects stored in datasets.
Heterogeneous datasets are datasets from several sources, each storing data representing objects. Each dataset includes an object identifier that identifies an object, a context value that identifies a context of the object, and a set of feature values that identify features of the object. The number of features and values often differ between each dataset, and within a dataset. Examples of such datasets are inventory catalog data from merchants, patient record data from hospitals, and technical paper data from publishers. For example, for inventory catalog data, an object identifier identifies a particular merchandise item, a context identifier identifies a particular vendor, and the set of feature values are words and numbers that describe the merchandise item.
Heterogeneous datasets are often integrated for data management, searching, and archiving operations. A common step in integrating heterogeneous datasets is determining a mapping between objects from one dataset and objects from another dataset. This step is often referred to as record linkage, matching, and/or de-duping. One useful matching strategy is to use a threshold similarity function that generates a similarity score from the feature values and identifies objects as identical if the similarity score exceeds a threshold value.
One widely-used similarity function is term frequency-inverse document frequency (TF-IDF) similarity. This similarity function identifies objects as similar if they are associated with a sufficient number of identical “terms”. TF-IDF processing works well in many situations, and the resulting statistics can be stored in compact form. TF-IDF processing also facilitates parallelization, and thus can be efficiently scaled. Other similarity processes that are used include edit distance processes, Jaccard distance processes, and token-based processes.
However, these processes do not take into account the context of the objects. This can cause, is some situations, skewing of similarity measures. One example situation is when a particular context in a dataset includes many of the same feature values, e.g., the merchant's store name. The store name is not highly indicative of object similarity, as the merchant may sell a number of different products. However, the presence of the store name as a feature in the dataset for many objects in a particular context can increase a similarity measure for any two of those objects.