Modern society is characterized by an increasing reliance on the accuracy of a rapidly expanding storehouse of data. At the same time, the accuracy of this data is of increasing importance for the functioning of modern enterprises, such as electricity utilities that operate smart grids. It is noted that data can become corrupted due to human error as it is entered into a system by hand, or as it is inaccurately detected by smart grid sensors. As such, when the smart grid data contains errors, the electricity utilities may have difficulty accurately identifying theft of electricity, may have difficulty accurately capturing historical load variation and capacity utilization, and may have difficulty accurately determining which premises are connected to which equipment identified as outage source.
There are different conventional techniques for finding and correcting errors within data. For example, automated rule development techniques include mining data to form association rules and mining data for conditional functional dependencies (CFDs). However, there is a general consensus in the field that association rules are inadequate for addressing data quality problems in large databases. The process of mining data for CFDs is emerging as a more promising approach to automated data quality detection and correction.
It is pointed out that CFDs are rules that enforce patterns of semantically related constants within the data. FIG. 1 provides an example of a simple CFD. In this case, the input data points 101 and 102 have three attributes which are a country code (CC), a state (S), and an area code (AC). A data set made up of such data points could be part of a database keeping track of the locations of an enterprise's customers. CFD 100 checks data based on the fact that if a country code is 01 for the United States, and an area code is 408, then the accompanying state should be California. Applying data input 101 to CFD 100 will result in a passing output value 103. Whereas applying data input 102 to CFD 100 will result in a failing output value 104.
There are two main drawbacks to the approach of automating the discovery of CFDs. The first is that the number of CFDs that could possibly be applied to a data set increases exponentially with an increase in the number of attributes in the data set. This results in a nearly prohibitive increase in the complexity of such a technique. In the example above, with a relatively simple set of three values there could still be 12 functional dependencies. The number of possible CFDs would greatly exceed that number multiplied by the more than 270 area codes in service in the United States. Current automated discovery techniques are also unable to handle noisy data.