1. Technical Field
Present invention embodiments relate to probabilistic matching of data records, and more specifically, to assigning a known token weight to a token with an undefined weight based on at least one of external reference information and historic statistics to perform probabilistic matching.
2. Discussion of the Related Art
In probabilistic matching systems, the weights for tokens or symbols (e.g., words, phrases, numbers, dates, etc.) are usually generated from the initial input data and are frequency based. The principle is that the higher the frequency, the lower the weight, i.e., higher token frequencies make the tokens less relevant because they are more common, and therefore, not as distinguishable from other tokens with lower frequencies for record matching. When building a token weight table, not all of the tokens are added to the weight table, but rather, only the most common ones in the input dataset are used and the rarer tokens are assigned a default value which is the highest value in the weight table, thereby indicating their relative rarity.
When matching data records (e.g., in a data records management system), the highest weight may end up causing an incorrect record match or false positive when new records are added or updated with tokens that do not exist in the weight table. For example, a weight table is generated based on the initial data load (e.g., from a database of existing records). This weight table may contain the name-string values (e.g., Bob Jones, John Smith, etc.) with their weights based on a calculation. For example, the token name Alice may have a relative weight of 279 denoted as ALICE|279, the token Aline may be assigned a weight, ALINE|323, while a default weight is assigned a value of 672. However, the weight table does not contain a weight value assigned to “Kylee” which may be because the initial data load does not contain Kylee or that Kylee appears rarely in the initial data load. Thus, if Kylee shows up later as an updating record, it will get the default weight of 672. With this high weight, records with Kylee will have a higher chance to be matched, even when the matched record is not relevant.
Accordingly, traditional approaches to record matching may result in errors that lead to false record matches even when a correct match exists.