Record linking is the task of finding records that refer to the same object across data sets. It is also known as data matching. This is a common data management problem faced whenever two data sets need to be combined or joined. The objects are the real-world artifacts the records refer to and the fields of the record are the attributes of the object. It is often the case that there is no common identifier between the two data sets. In this case, the fields need to be matched between the two data sets, and then for each combination of records between the two data sets, the data in each field are compared to determine whether any two records refer to the same object or entity.
It is understood that the objects, which are being matched have common, generally accepted attributes. It is common in data management to refer to an area of objects and their attributes as a domain. There are a number of domains that people apply record linking to. For instance, people is a very common domain for record linking. For the people domain, the data sets are comprised of records that have name and address information. These records can for example refer to customers at a financial institution or patients at a hospital. Another common domain for record linking is product information. In this domain, the data sets can contain the name, manufacturer, and descriptions of products.
There are a number of systems that have been created, which use machine learning, including statistics, probability, and other methods, to determine when two records match each other in a pair of data sets. There are generally accepted ways to compare data that are then used by the machine-learning algorithm. These systems require a set of training data where a human user has specified pairs of records as matching or not matching. The training data is used to train a machine learning algorithm so that it can determine whether the rest of the pairs of data under consideration are matches or not automatically.
The biggest problem that these systems face is the creation of training data. It is time consuming for a person to review pairs of records from two data sets to determine whether they match or not. It is a generally held belief in the machine learning community that the larger the set of training data the better the system will work. More training data means that the machine learning algorithm, no matter which one is used, will be better trained and better able to automatically match records.
It is almost always the case that the data being matched in record linking systems is either private or sensitive information. For each domain, there is usually a requirement to keep the information in the data sets private, whether for regulatory and legal reasons, or for competitive reasons.
Thus, when a company performs record linking on their data in a domain, there is no way to utilize the training data that another company has created in the same domain. There is no way to share training data with other companies in a way that adheres to regulatory, legal, and competitive requirements.
As such, considering the foregoing, it may be appreciated that there continues to be a need for novel and improved devices and methods for performing record linking, while allowing users to share training data in a secure manner.