The subject matter described herein generally relates to managing data quality and cleansing data. Certain subject matter presented herein relates to synonym identification and standardization of addresses.
Existing data management and cleansing tools help organizations ensure that their strategic systems, including data warehouses, deliver accurate, complete information to business users across the enterprise. Equipped with trusted information, organizations can make more timely and better informed decisions. Existing tools include for example a graphical user interface (GUI) and capabilities that can be customized into specific business rules, offer some control over international names, addresses, phone numbers, birth dates, email addresses, and other descriptive fields. Existing tools are designed to discover relationships among database entries in an enterprise and Internet environment, both in batch mode and in real-time.
Using existing tools, companies hope to gain access to accurate, consistent, consolidated views of any individual or business entity and its relationships. Data from disparate sources can be standardized into fixed fields using business driven rules to assign the correct semantic meaning to input data in order to facilitate matching. Once standardized, matching capabilities are employed to detect duplication and other relationships in the data despite anomalous, inconsistent, and/or missing data values. A statistical matching engine can, for example, assess the probability that two or more sets of data values refer to the same business entity, providing more accurate match results.