Modern society is characterized by an increasing reliance on the accuracy of a rapidly expanding storehouse of data. The IDC determined that the amount of data produced worldwide in 2007 was 281 exabytes, representing a 56% year-over-year increase from 2006. At the same time, the accuracy of this data is of increasing importance for the functioning of modern enterprises. Recently, the United States Government was embarrassed when a publicly accessible database indicated that several grants of money from a recovery program were distributed to Congressional Districts that did not exist. Aside from causing embarrassment and confusion, poor data quality can cause serious economic harm. Data can become corrupted due to human error as it is entered into a system by hand, or as it is taken up by human designed sensors. Since human error is unavoidable, so is the potential corruption of the data that society relies upon.
Given the increasing amount of data that large organizations are forced to deal with, several companies provide products and services that help to screen large databases for errors and correct them. Such companies are generally called data quality vendors and the service they provide of screening and correcting databases is called data quality enhancement. Data quality enhancement is generally an automated process, wherein a computer screens through all of the data in an electronic storage database and automatically flags or deletes data values that appear to be erroneous.
The critical task in data quality enhancement is the identification of rules that validate, cleanse, and govern poor quality data. To use the example of the government relief program mentioned above, a sufficient rule would be that any entry for a district where money is being spent should also appear in a list of all the congressional districts in the United States. Data quality rules can be identified using either manual or automated development. Manual development involves a data or business analyst leveraging the input of a subject matter expert (SME), or utilizing a data profiling tool.
SMEs are persons that understand the characteristics of data sets that encompass information within their field of expertise. For example, a data analyst may leverage a SME in the utilities field to learn that meters have serial numbers that are usually recorded incorrectly, and are connected to transformers with serial numbers that are related to the serial numbers of the meters. The analyst would then be able to take in this information and create a data quality rule that screened for serial numbers in a data set that did not fit the pattern described.
Data profiling tools are computer programs that examine data of interest to report statistics such as frequency of a value, percentage of overlap between two columns, and other relationships and values inherent in the data. Examples of data profiling tools include TS Discovery, Informatica IDE/IDQ, and Oracle Data Integrator. The information gleaned from a data profiling tool can indicate potential quality problems. Analysts use the information they obtain from the use of a data profiling tool to manually create rules that can enhance the quality of the examined data.
Some profilers, such as Informatica Data Explorer, can automatically infer basic data quality rules on their own. For example, they can set a rule for which columns cannot have null values. However, this is a particularly simple data quality rule. Null value entries are the easiest type of error to detect because they are clearly indicative of a data entry oversight and they do not have values equivalent to any possible correct entry. Other profilers, such as TS Discovery, Informatica Data Quality, provide out-of-the-box rules for name and address validation. These rules are also somewhat rudimentary because addresses are characteristically regimented, are a quintessential element for large commercial databases, and follow tight patterns. Available data profilers do not contain rules that target more complex, or more client-specific quality problems.
Both forms of obtaining information for the manual development of data quality rules have their drawbacks. Modern data profiling tools are extremely powerful and can present an analyst with a mountain of data characteristics and inter-relationships within a dataset. However, the creation of actionable data quality rules will still require the time consuming and non-trivial process of interpreting and applying the acquired statistics. Acquiring information from an SME can also be time consuming and difficult given that the information must often be gleaned through a personal interview which requires man hours from both the analyst and the SME. For obvious reasons, it is also time consuming for an analyst to short circuit interactions with an SME and attempt to become proficient in the data bases of a given area on their own.
Automated rule development methodologies have been described in the academic literature. Such methods include most prominently mining data to form association rules and mining data for conditional functional dependencies (CFDs). There is a general consensus in the field that association rules are inadequate for addressing data quality problems in large databases. The process of mining data for CFDs is emerging as a more promising approach to automated data enhancement.
CFDs are rules that enforce patterns of semantically related constants. FIG. 1 provides an example of a simple CFD. In this case, the input data points 101 and 102 have three attributes which are a country code (CC), a state (S), and an area code (AC). A data set made up of such data points could be part of a database keeping track of the locations of an enterprise's customers. CFD 100 checks data based on the fact that if a country code is 01 for the United States, and an area code is 408, then the accompanying state should be California. Applying data input 101 to CFD 100 will result in a passing output value 103. Whereas applying data input 102 to CFD 100 will result in a failing output value 104.
There are two main drawbacks to the approach of automating the discovery of CFDs. The first is that the number of CFDs that could possibly be applied to a data set increases exponentially with an increase in the number of attributes in the data set. This results in a nearly prohibitive increase in the complexity of such a method. In the example above, with a relatively simple set of three values there could still be 12 functional dependencies. The number of possible CFDs would greatly exceed that number multiplied by the more that 270 area codes in service in the United States. Current automated discovery methods are also unable to handle noisy data.