Corporations base important decisions on results derived from data. Therefore the verification of the accuracy of data is of utmost importance in order for the management of corporations to avoid faulty conclusions. Furthermore, detection of problems with the accuracy of data should be made in a timely fashion, because backtracking to fix data glitches and recover accurate analyses can be expensive and time consuming. In fact, if data corresponding to transactions are overwritten or retrospective access is expensive, recovery of the data might not be possible at all.
Traditionally, the data sets used by statisticians were collected meticulously either according to a pre-determined design or to answer specific substantive questions such as “Does drug A have a significant effect on reducing the symptoms of disease X?”. The amount of data collected was generally small, the data were measured carefully and repeatedly, and the analyst had a fair idea of what the measured values should be. Anomalies were easy to detect just by examining the raw data or scatter plots. However the size and complexity of large data sets commonly encountered today renders visual scanning an infeasible screening method. Some analysts have adopted ad-hoc methods borrowed from quality control techniques used in manufacturing. While effective for process control in engineering and manufacturing, these methods do not translate well to immediate detection of inconsistencies in large complex data sets, especially for multivariate data.
Data glitches are abnormal patterns that are either aberrations from historical behavior, inconsistencies in sections of the data, or departures from acceptable tolerance limits. Earlier work has focused on detecting systematic changes that usually show up as differences that can be attributed to a substantive reason (e.g., “new subscribers have different usage patterns”) and are accentuated over time. Data glitches on the other hand, tend to be scattered erratically across the data space and are not persistent over time (i.e., the cause of the data glitch usually disappears).
Data quality control in most instances is still implemented by the use of conventional quality control (“QC”) charts. Famous examples of quality control charts are Shewhart charts, named after W. A. Shewhart who first proposed them in 1938 (e.g., X-charts, R-charts). There are many variations of Shewart charts as well as other charts such as the Cumulative Sum, the Operating Characteristics Curve, Average Run Length, p-chart and others, designed for different QC situations, including adjusting for trends over time.
Quality control charts are aimed at detecting a process that drifts out of control over time. Typically no action is taken unless there is a process run of abnormal outcomes. Moreover, sampling plays a critical role in the implementation of the charts. The method of sampling as well as the sample sizes will strongly influence the conclusions drawn from the charts. The assumption of normality is indirectly required.
A significant portion of data quality research has focused on managing and implementing data quality processes. Recently there has been an emphasis on data warehouses, and monitoring and measuring the information that resides in them. Most commercial and academic efforts in the database community are focused on merging/purging/deleting duplicates and issues related to name and address matching. In the statistical community, the focus is on quality control methods borrowed from process control charts. Extensions of control charts to multivariate settings have been proposed by researchers.
While previously-known QC methods have been used with success in industry, such methods are unsuited for detecting glitches within large, high-dimensional data sets because:                Heterogeneity—Large data sets tend to be heterogeneous.        Localized changes—Averages tend to be very stable, not easily moved by changes in small subsections of the data. Therefore charts based on overall aggregates will not detect glitches due to localized changes.        Large number of attributes—Multivariate control charts are rare, especially without the normality assumption. Computing simultaneous confidence intervals is hard and visualizing them even harder. Bivariate charts for normally distributed data have been previously discussed. Depth-based control charts have been proposed for multivariate data using ranks to create analogs of the traditional quality control charts.        
In the context of the automatic screening of data there are two other considerations that are important. These are:                Immediate detection of glitches rather than over time or a sequence of samples.        Isolating the areas or sections of the data (such as heavy users, long calls, revenue) that become corrupted because of the glitch.        
Data mining methods are especially suited for the automatic detection of glitches in massive data. Such methods:                scale well to large data sets        isolate abnormal patterns, and        do not (usually) make distributional assumptions.        
Thus, a data mining method is needed that automatically and quickly (e.g., in linear time) detects data glitches over large data sets, isolates problematic sections of data, and is widely applicable.