Databases are commonly used in businesses and organizations to manage information on employees, clients, products, etc. These databases are often custom databases generated by the business or organization or purchased from a database vendor or designer. These databases may manage similar data; however, the data can be presented in different formats. For example, a database may store a U.S. phone number in a variety of formats such as (123) 555-1234, as 1-123-555-1234, or as 123-555-1234. Furthermore, the databases may manage data in similar format but with no overlap in the values. For example, a database for employees on the west coast of the U.S. can have different area codes from a database for employees on the east coast of the U.S. The data in the phone fields looks similar, but there is no intersection or overlap in the value of the data.
This variability in data format becomes an issue when databases with dissimilar data formats for similar data are merged. Automatic matching of data in databases based on format or value can be difficult to achieve. For example, a business with an extensive customer database may acquire another company. The business wishes to merge or integrate the customer databases. To merge or integrate source databases into a target database, the source databases are analyzed on a field-by-field or table-by-table basis and data matching is performed. The goal of data matching is to determine which field in each of the source databases comprises, for example, the name of the customer, the phone number of the customer, the fax number, etc. and match the tables in the source databases on a field-by-field basis.
Data matching determines whether two input datasets or two sequences of data values are similar and quantifies the similarity. One conventional approach for data matching uses meta-data in schema-based data matching. Schema-based data matching examines names of fields and names of tables in databases, attempting to match data in fields through the name of the field. In one source, a field for a client phone number may be named CLIENTPHONE. In another source, a field for a client phone number may be PNUMCLIENT. Schema-based data matching may use techniques such as linguistic analysis to locate and match these fields.
While schema-based data matching has proven to be useful, it would be desirable to present additional improvements. Schema-based matching has difficulty in matching fields when a database designer uses cryptic field names or table names. Furthermore, schema-based matching typically cannot identify matching fields when designers speaking different languages write source databases. For example, one source database may have field names cryptically derived from the German language while another source database may have field names cryptically derived from the English language.
Another conventional data matching approach uses instance-based data matching. Instance-based matching utilizes statistics in the form of a distribution of actual values in a data sequence as a basis for similarity computation. Instance-based data matching examines values in a field independently of the field name. One instance-based data matching approach examines overlap between values in fields of source databases. If, for example, a 100% overlap exists between a field in one source database and a field in another source database, the fields are determined to be identical and they match. Another instance-based data matching approach examines a statistical distribution of values in a field. Fields in source databases are determined to be similar if the distribution is similar.
Although instance-based data matching has proven to be useful, it would be desirable to present additional improvements. Instance-based data matching cannot match source datasets that have disjoint data with no overlap. An example of such disjoint datasets is employee phone numbers for merging companies in which the phone numbers for each of the merging companies comprise different area codes. With no overlap between the area codes, instance-based data matching cannot match the source fields for employee phone number. Similar issues affect matching for social security numbers, vehicle ID numbers, credit card numbers, postal codes, etc.
Conventional data matching approaches identify matching fields through field names or through field values. However, often data in fields are presented in a pattern that can be discovered and matched by a data matching technique. What is therefore needed is a system, a service, a computer program product, and an associated method for matching pattern-based data. The need for such a solution has heretofore remained unsatisfied.