1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer systems and software. More particularly, but not by way of limitation, one or more embodiments of the invention enable a method and apparatus for matching non-normalized data values to determine if two or more data items are related in accordance with configurable criteria, enable the merging of data items and further to learn which match criteria settings are appropriate based on previous user input or results.
2. Description of the Related Art
Matching or searching non-normalized records in a database or between multiple databases is error prone and inefficient. For instance, when matching a given string with a non-normalized field of a database many records that should match a given search string fail to match. Entries that represent the same item but have different formatting or irrelevant characters fail to match. Thus the amount of time required to find a match can be excessive. This is particularly the case if all permutations of the match are utilized in the search process. Values such as “123-x” do not match “123:x” for example although they may represent the same item. Other matches that fail include a match for example on “X.Y.Z. Corp” against “XYZ Inc.” The related art fails to ignore “noise” characters and words when attempting to match items and in general do not match items in databases using fuzzy patterns or relevancy.
Historically, matching has been performed as part of an inbound cleansing process. Generally, known products do not attempt to de-duplicate data that has already been cleansed during import. Over time, as data entry occurs where human error is afoot, duplicates begin to creep into the database. Keeping data consistent across multiple distributed enterprise-wide computer systems is non-trivial. Establishing effective communication links between heterogeneous systems is the first step for making the data consistent. However, simply allowing all computer systems within an organization to communicate does not solve the problem. Even when data is shared throughout an enterprise, problems still arise since data may exist in different forms in different locations within the enterprise. Since the goal of absolutely accurate data is elusive, it is common for companies to maintain data in independent computer systems. For example, because of the difficulties associated with identifying and matching similar data, some companies maintain data for each corporate division in independent computational zones and only utilize such data within a division to make a business decision associated with that particular division. It is common after one company acquires another company for the computer systems of each company to remain autonomous. Thus, the possibility of identifying and matching common data items within each repository is generally very low.
To solve the problem of having duplicate data albeit in slightly different form, businesses attempt to identify similar data and integrate the data in a way that ensures the data remains consistent. Performing the integration is difficult and breaks down when new corporate computer systems are added through acquisition or changes in business systems and software occur. One method that is used by some organizations is to maintain “master data”. Master data for example may be an organization's ideal form of a data item. Solutions for keeping the data consistent through the organization, i.e., propagating master data throughout the organization, are generally non-robust and brute force communication schemes that do not allow new data entries to be matched against existing data items to effectuate data consolidation at data entry time.
The inability to keep master data items consistent harms an organization's ability to leverage its assets and lower the cost of doing business. All areas of a business are affected by the inability to keep data as accurately as is possible. In summary, existing computer systems and methods lack effective mechanisms for performing data matching in a way that allows the system to learn when data matches are appropriate. For example, existing systems and methods do not have an ability to learn and consolidate two data items that originally where thought to be independent, but which have been matched above a threshold. The ability to learn which patterns in data are actually indicative of a match between two data items is not found in existing enterprise computing solutions.
Because of the limitations described above there is a need for a method and apparatus for matching non-normalized data values.