Detecting data quality problems—especially within enterprise data—is important: without quality data to begin with, accurate business intelligence cannot be derived; nor critical strategies executed.
One challenge for managing enterprise data (EDM) involves comparing data obtained from various (e.g., internal and external) data sources. In many circumstances, these data sources use inconsistent terms and definitions to describe data, rendering it difficult to compare or exchange data across different sources, to automate business processes, and to provide a uniform data structure for data consumption/analysis by other (e.g., ERP) applications. Difficulties in data mapping and cross-referencing often follow. Normalization (of terms and definitions) at data attribute level is referred to as the metadata component of EDM and is an essential prerequisite for effective data management.
Difficulties abound, however. One technical problem is that special skills, such as expert knowledge of a large number of data formats/syntax is often required in discerning potential data quality problems. For example, a user is usually required to know what a full P.O. Box address in Japan looks like in order to determine whether a given address is likely to be correct or otherwise. Other examples include format for landline numbers in Brazil, syntax for residential postal addresses in China, and naming conventions for street names in Japan.
Another technical problem is that, even with expert knowledge, examining enterprise data manually is both time- and resource-consuming. For example, it may take weeks (or even months) for a data analyst to go through spreadsheets containing phone numbers collected from 10 cities in South Korea, in order to make sure these data can (or should) be used to form best records.
There therefore is a need for improved techniques for detecting potential data quality problems.