Different organizations, or even business entities within a same umbrella organization, may store data representing similar record topics (e.g., customers, products, vendors, etc.) in varying forms. Where the data stores are not connected (e.g., no unique key matches data records between the individual data stores), to benefit from analysis of the pooled collection of data representing these record topics, the data representations must somehow be matched between the disparate storage models. In one particular example, the company name for a same corporation may be represented as “YHW Corp” in one set of data, and “Yellow House Wares Corporation” in another set of data.
The inventors recognized a need to use advanced matching models to support merging of inconsistent data records. Their solution involves n-field comparison and complex content analysis. In one aspect, the matching solution utilizes a model that incorporates many fields in the merging process via a combinatorial function, thereby vastly improving the probability of a correct match. The matching solution further generates a compatibility index for ranking data record matches so that the highest ranked (highest in matching confidence) is selected.
In another aspect, the inventor's solution utilizes a Bayesian classifier to apply conditional probability to disparate field information (e.g., inconsistent product description data, inconsistent organization naming conventions, etc.) so that a given candidate pair of field information is classified as being more similar to either a match or non-match. The inventor's combined application of mathematical and machine learning matching algorithms is a significant departure from the traditional and more restrictive rules engine methodologies.
In a further aspect, the inventors recognized a need for developing testing algorithms to test the accuracy and viability of the new matching algorithms.