Record matching, also referred to as “data matching,” “record linkage,” or “special purpose grouping,” generally relates to the task of finding database records stored in a data warehouse that refer to the same individual or entity. These database records may come from different data sources (e.g., different entities supplying records, different types of records supplied, etc.), or may be variations within a data source (e.g., different data entry protocols, different data cleansing protocols, etc.).
Data warehouses are used in a wide range of applications to store large volumes of data records. For example, data warehouses can be used to store large volumes of credit card user data, credit score data, education data, healthcare data, business credential data, or any other application that may utilize record matching. The data records stored in the data warehouses may include a number of attributes that can be used to match the data record with a specific entity or individual.
Frequently, a data warehouse will receive new data from one or more sources. When new data is received, it needs to be merged into the database. If the new data received is not associated with any entity or individual that has a record in the database, then the new data will be added into the database as a new record. If the new data is associated with an entity or individual that already has one or more records stored in the data warehouse, then the new data should be associated with the existing record or records for that individual. This is the role of record matching.
Presently, record matching is generally performed in one of two ways. The first is that when the data arrives, it is cleansed. A clean copy of the data is stored in a data warehouse with a golden record identified. A golden record is the cleanest copy of the merged information of the data set. Once data is cleansed, as incoming data arrives, that data is also cleansed and then matched using predefined algorithms. These algorithms can include exact matching algorithms, Jaro-Winkler algorithms, or distance measuring algorithms.
The first option has certain disadvantages. It requires significant data manipulation by cleansing and updating/merging the data into the database. This is problematic because the data that must be manipulated may be owned by another entity. In this case, a matching service may not have permission to manipulate the data, or may even be prohibited by law from manipulating the data. If data manipulation were permitted, then issues regarding data integrity, for example ensuring no important data is lost during the manipulation, may arise.
The second option is to perform matching of several elements of the data and, depending on the results, match additional elements. This option involves comparing a number of elements to the entire database of records, which may include hundreds of millions of records. This technique is computationally intensive and requires significant processing power and time. Though it works well for matching one record, it becomes time consuming and costly to match large amounts of data to a large data set.
Thus, a need exists for a record matching method and system that significantly improves server efficiency for batch record matching, without sacrificing accuracy and without the need to manipulate data records stored in data warehouses.