When overlapping or redundant information from multiple sources is integrated, inconsistencies or conflicts in the data may emerge as violations of integrity constraints on the integrated data. For example, enterprise applications often have separate applications associated with different departments, such as sales, billing, and order- or service-fulfillment, storing overlapping business data. Conflicts in this data may be introduced for many reasons, including misspellings or different conventions used during data entry (e.g., a person's name may appear as “John Smith” and “J. Smith”) and different processes and time-scales for performing updates (e.g., address changes may take a few days to a few months to propagate).
This problem becomes particularly evident with data warehousing or other integration scenarios because combining data makes conflicts visible, while errors in a single database can seldom be detected without inspection of the real world or other manual effort. The consequences of poor enterprise data can be severe. For telecommunication service providers, for example, errors routinely lead to problems such as failure to bill for provisioned services, delay in repairing network problems and unnecessary leasing of equipment. As a result, data sources may be integrated in order to reconcile and correct the source data. For example, revenue recovery applications compare billing and service databases to ensure that all services are billed (and presumably vice-versa).
While substantial previous work has explored query answering and constraint repair in inconsistent databases, the bulk of this work restricts repair actions to inserting and deleting tuples. However, in these models, repairs of inclusion dependencies may lose important information. Recent work has introduced repairs in which attribute values are modified to restore the database to a consistent state, allowing more satisfying resolution of common constraint violations. Record linkage is a broad field, also known as “duplicate removal” or “merge-purge,” and refers to the task of linking pairs of records that refer to the same entity in different data sets. This is commonly applied to household information in census data, mailing lists or medical records as well as many other uses.
A number of techniques have proposed the modification of attribute values for restoring constraints. See, for example, Franconi et al., “Census Data Repair: A Challenging Application of Disjunctive Logic Programming,” Proc. Logic for Programming, Artificial Intelligence and Reasoning 2001 (LPAR'01), 561-578, (2001); and J. Wijsen, “Condensed Representation of Database Repairs for Consistent Query Answering,” Int'l Conf. on Database Theory (ICDT) (2003). The applicability of these existing techniques, however, is restricted to specific databases or certain constraints. For example, Franconi et al. consider detecting and solving conflicts for specific census databases of a fixed schema. J. Wijsen studies consistent answer of conjunctive queries in the presence of universal (full) constraints.
A need therefore exists for a method and apparatus for modifying attribute values to restore constraints using a cost-based notion of minimal repairs. A further need exists for a method and apparatus for modifying attribute values for restoring a plurality of constraints over an arbitrary number of tables. Yet another need exists for an equivalence class-based method and apparatus for modifying attribute values for restoring constraints.