Data quality in data warehouses (DWH) is a common sore point across all IT departments. In order to assure the usability of a DWH, data quality must be checked, and data corruption addressed. Unfortunately, this process of data integrity check is highly time consuming, as it is difficult to fully automatize: some basic tests can be automatic, but they are too rigid, and human expertise is commonly required to perform deeper qualitative checks. It is also a recurrent task, since a clean data stream at one point in time often gets corrupted at source at any later point in time.
In one example of data corruption issue, for a given field, the following temporal series of values: [5, 4, 5, 4, 3, 6, 4, 5, 112, 5, 4, 8, 4, 6, 5, 10, 4, 6 . . . ] can probably trigger statistical monitoring techniques, and raise an alert signaling that at least the data value of 112 is an abnormal value. This alert will then need investigating by a human expert. However, if the value 112 was replaced by 8, whether it would trigger an alert is all dependent on rigid parameterization.
Also, when dealing with non-numerical fields, the use of statistical methods requires setting up distance metrics in order to be able to compute a measure of distance between two values. Such methods can be hard to setup and maintain.
Many systems claim to be performing data integrity checks. Every system has its own tests and performance; however they all have in common to be using fix rule-based checks. The main disadvantages are therefore a low coverage of possible errors, and a total incapability of detecting errors that do not correspond to the fixed catalogue of possible errors.
A survey of different approaches and techniques has been undertaken in 2000. See Data Cleaning: Problems and Current Approaches (2000), Erhard Rahm & Hong Hai Do, in IEEE Data Engineering Bulletin, Volume 23, 2000. It shows that systems are largely organized around coping with types of errors that are listed and documented. It shows very little generic character, and I did not see any machine learning techniques.
However, it is important to point out that different techniques are complementary. Documented errors spotting in some rule-based systems will allow, for example, the detection of some misspellings of common names. This can probably not be done through machine learning techniques. For a global coverage of data integrity, it is relevant to think about combining traditional rule-based systems together with machine learning techniques.
To summarize, the issues met by current systems are: automatic methods are not generic enough; high labor cost of human checks; low coverage of human checks; high operational impact resulting from this low coverage; and high labor cost of containment measures.
This disclosure proposes a system where these issues are all lessened, by using unsupervised and supervised machine techniques, enabling automatic checks to be performed, to a certain degree without any training set, and to a higher level of coverage after labeling has been performed by a human expert, hence bridging the gap between fully automatic anomaly detection methods that are too shallow, and human deep monitoring methods that are inapplicable since too time consuming.