1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer systems and software. More particularly, but not by way of limitation, one or more embodiments of the invention enable systems and methods for matching similar master data objects using associated behavioral data, for example transactional data.
2. Description of the Related Art
Businesses utilize data repositories to make business decisions. The data repositories house objects such as companies, customers, persons, products and other entities. Business decisions are best made using data that is as accurate as possible. Problems arise when data repositories contain data duplicates that should be merged into one object, but which exist as slightly different objects. Minimizing data duplications across multiple distributed enterprise-wide computer systems is difficult. Within businesses that house data in multiple data repositories, data that exists on one data repository may actually represent data in a slightly different form in another data repository. Hence, businesses attempt to merge data duplicates and utilize unified versions of business objects known as master data. Failing to keep master data objects consistent lowers the ability of an organization to leverage its data, which in turn hurts profits.
Because of the problems associated with maintaining master objects based on similar data, some companies maintain data for each corporate division in independent computational facilities and databases. Hence, the business decisions are local to a division. By utilizing multiple data repositories a business cannot leverage combined buying power to obtain lower prices from common vendors. This architecture may be maintained after a company acquisition for example. Conversely, many organizations attempt to merge their data repositories to yield unified data. Solutions that attempt to match similar fields within data records utilize word dictionaries, token-based matching, normalization rules and regular expressions. For example, “Avenue” in one record may be abbreviated as “Ave.” in another record. In such a case a regular expression of “Ave.*” will match both fields. Additionally, “Dick” and “Richard” may be matches for the same data object as selected from a list of synonyms which represent a name rule. This type of de-duplication utilizes a myriad of unreliable data and therefore yields false matches and missed matches. In addition to the relatively low accuracy achieved, there is a high cost of creating and maintaining data dictionaries and rules.
Furthermore, known matching strategies tend to place a high score or weight on the physical “address” field of two objects. If two objects include a large amount of shared substrings, such as a street number or apartment number, then it is reasonable to conclude that the two objects are duplicates. Hence, most de-duplication efforts to clean and enrich master data rely heavily on the location associated with an object. The problem with this approach is that many objects are migrating to virtual environments. For example, many customers are utilizing web-based interfaces to access bills and statements. As the world becomes less reliant on physical addresses, the strength of master data de-duplication based on physical address weakens. There are no known solutions that augment de-duplication with behavior data that is independent of physical address.
In summary, existing computer systems and methods lack effective mechanisms for performing data matching in a way that allows the system to utilize actions associated with the data objects, e.g., transactional data, to determine if the data objects are or are not duplicates. Because of the limitations described above there is a need for a system and method for matching similar master data using associated behavioral data.