Many consumers want to research the background of childcare providers, coaches, healthcare workers and home contractors. Many of us want to reconnect with family and friends. Business users often have good reason to check and monitor the background of potential employees and ensure they are the best person for the job. Criminal checks can enable consumers to make informed choices about the people they trust by delivering appropriate and accurate public records information. Identity verification can assist businesses and consumers to confirm that someone is who they say they are.
Given enough time to search, you may be able to discover if your prospective employee has been charged with DUI, if your son's soccer coach has ever been accused of domestic violence, and where your old college roommate or high school classmate is now living. But a significant challenge is to be sure the information you are seeing pertains to the right person or corporation. We are all concerned about detection and prevention of identity theft, America's fastest-growing crime. Many of us once carried our social security cards in our wallets, but the risk of identity theft has cautioned us to now reveal our social security number to no one other than our bank, our employer and our taxing authority on a confidential need-to-know basis. Without a unique national identification number that could serve as an index into a variety of records from many different sources, it is challenging to accurately link such records without making mistakes that could provide incorrect information, damage someone's reputation or destroy confidence in the information source.
FIG. 1 shows an example of the scope of the problem. There are over 300 million people living in the U.S. alone, and billions of records pertaining to all those people. Take the example of trying to find out accurate personal information about Jim Adler, age 68 of Houston Tex. By analyzing available records, it may be possible to find 213 records pertaining to “Jim Adler” but it turns out that those 213 records may pertain to 37 different Jim Adlers living all over the country. It is desirable to link available records to determine which ones pertain to the right Jim Adler as opposed to Jim Adler age 57 of McKinney Tex., or Jim Adler age 32 of Hastings Nebr., or Jim Adler age 48 of Denver Colo. or any of the other 33 μm Adlers for whom there are records. It is further desirable to avoid incorrectly linking Jim Adler the First Selectman of Canaan N.H. with Jim Adler serving time in the Forrest City Arkansas Federal Correctional Institute.
Some have spent significant effort to build comprehensive databases that link related records to provide background, criminal, identity and other checks for use by employers and consumers. When consolidating information from multiple data sources, it is often desirable to create an error-free database through locating and merging duplicate records belonging to the same entity. These duplicate records could have many deleterious effects, such as preventing discoveries of important regularities, and erroneously inflating estimates of the number of entities. Unfortunately, this cleaning operation is frequently quite challenging due to the lack of a universal identifier that would safely but uniquely identify each entity.
The study of quickly and accurately identifying duplicates from one/multiple data source(s) is generally recognized as Record Linkage (“RL”). Synonyms in the database community include record matching, merge-purge, duplicate detection, and reference reconciliation. RL has been successfully applied in census databases, biomedical databases, and web applications such as reference disambiguation of the scholarly digital library CiteSeerX and online comparison shopping.
One example non-limiting general approach to record linkage is to first estimate the similarity between corresponding fields to reduce or eliminate the confusion brought by typographical errors or abbreviations. A straightforward implementation of similarity function could be based on edit distance, such as the Levenshtein distance. After that, a strategy for combining these similarity estimates across multiple fields between two records is applied to determine whether the two records are a match or not. The strategy could be rule-based, which generally relies on domain knowledge or on generic distance metrics to match records. However, a common practice is to use Machine Learning (ML) techniques, to treat the similarity across multiple fields as a vector of features and “learn” how to map them into a match/unmatch binary decision. ML techniques that have been tried for RL include Support Vector Machines, decision trees, maximum entropy, or composite ML classifiers tied together by boosting or bagging.
Due to the importance of feature representation, similarity function design is at the core of many record linkage studies. As noted above, perhaps the most straightforward one is the Levenshtein distance which counts the number of insert, remove, and replace operations when mapping string A into B. Considering the unbalanced cost of applying different operations in practice, it is possible to modify the definition of edit distance to explicitly allow for cost customization by designers. In recent years similarity function design is increasingly focused on adaptive methods. Some have proposed similar stochastic models to learn the cost factors of different operations for edit distance. Rooted in the spirit of fuzzy match, some consider the text string at the tuple level and proposes a probabilistic model to retrieve the K nearest tuples with respect to an input tuple received in streamed format. Exploiting the similarity relation hidden under a big umbrella of linked pairs, some have iteratively extracted useful information from the pairs to progressively refine a set of similarity functions. Others introduce the similarity functions from probabilistic information retrieval and empirically study their accuracy for record linkage.
Whatever ingenious methods may be used for similarity functions, it is desirable to integrate all of these field-level similarity judgments into an overall match/no-match decision. Various learning methods have been proposed for this task. Some have proposed stacked SVMs to learn and classify pairs of records into match/unmatch, in which the second layer of SVMs is trained on a vector of similarity values that are output by the first layer of SVMs. Others consider the records in a database as nodes of a graph, and apply a clustering approach to divide the graph into an adaptively determined number of subsets, in which inconsistencies among paired records are expected to be minimized. Some instead consider features of records as nodes of a graph. Matched records would excite links connecting corresponding fields, which could be used to facilitate other record comparisons.
A well-performing pairwise classifier depends on the representativeness of the record pairs selected for training, which calls for an active learning approach to efficiently pick informative paired records from a data pool. Some have described committee-based active learning approaches for record linkage. Considering the efficiency concern of applying an active learning model on a data pool with quadratically large size, others propose a scalable active learning method that is integrated with blocking to alleviate this dilemma.
Despite its importance in producing accurate estimation of duplicates in databases, insufficient attention has been given to tailoring ML techniques to optimize the performance of industrial RL systems.
Example illustrative non-limiting embodiments herein provide cost sensitive extensions of the Alternating Decision Tree (ADTree) algorithm to address these and other problems. Cost Sensitive ADTrees (CS-ADTree) improve the ADTree algorithm which is well-suited to handle business requirements to deploy a system with extremely different minimum false-positive and false-negative error rates. One exemplary illustrative method assigns biased misclassification costs for positive class examples and negative class examples.
Exemplary illustrative non-limiting implementations provide record linkage of databases by ADTree. Considering that a problem of record linkage is that the business costs of misclassifying a matched pair and an unmatched pair can be extremely biased, we further propose CS-ADT which assigns a higher or lower misclassification cost for matched pairs than for non-matched pairs in the process of training ADTree. Experiments show CS-ADTree and ADTree perform extremely well on a clean database and exhibit superior performance on a noisy database compared with alternative ML techniques. We also demonstrate how the run-time representation of ADTree/CS-ADTree can facilitate human understandability of learned knowledge by the classifier and yield a compact and efficient run-time classifier.
Because ADTrees output a single tree with shorter and easy-to-read rules, the exemplary illustrative non-limiting technologies herein can effectively explain its decisions, even to non-technical users, using simple score aggregation and/or tree visualization. Even for very large models with hundreds of features, score aggregation can be straightforwardly applied to perform feature blame assignment—i.e., consistently calculate the importance of each feature on the final score of any decision. Improved understanding of these models can lead to faster debugging and feature development cycles.
Other non-limiting advantages features and advantages include:
Human-understandability
Confidence measure
Runtime efficiency from lazy evaluation
Capture intuitive feature interactions
Very competitive F-Measure
Bake preference for precision into algorithm
Better recall at high levels of precision
Other.