1. Field of the Invention
The present invention relates in general to computer databases. More particularly, the present invention relates to a method and system for detecting data deviations in a large database.
2. Description of the Related Art
The importance of detecting deviations or exceptions to data in a database has been recognized in the past, especially in the fields of machine learning and maintaining data integrity in large databases. Deviations are often viewed as outliers, errors, or noise in the data. As a related art, statistical methods have been used for identifying outliers in numerical data. See, for example, “Understanding Robust and Exploratory Data Analysis,” D. C. Hoaglin et al., 1983. Typically, these methods first compute the average and standard deviation of the data values in the data set. The data values that lie beyond three standard deviations from the average are considered deviations. However, such detection methods suffer two major limitations. They assume that a metrical distance function exists, and they are only applicable to numerical values.
Another approach to detecting deviations in a database is to use explicit information residing outside the data such as data integrity constraints or predefined error/non-error patterns. Detection methods using integrity constraints are described, for instance, in “Using the New DB2: IBM's Object-Relational Database System,” Morgan Kaufman, 1996. As an example of constraint-based detection, assuming that one of the attributes of the data is a person's age, an integrity constraint may be that the value of the attribute age must be less than 90. When this constraint is applied to the data, any person having the age of 90 or more will be detected as a deviation.
For deviation detection methods based on error patterns, an example of the error patterns would be a regular expression of the form [a-z, A-Z]+, meaning that a valid character string consists only of letters. By applying this pattern to the data, one can isolate those strings which do not follow this pattern as deviations. A drawback of this approach is that its success depends on the knowledge of the person specifying the error pattern concerning the target data. Such knowledge must be available to that person from a source external to the data, rather than from the data itself.
Alternatively, the problem of detecting deviations can be considered as a special case of clustering of data. In the case of deviation detection, the data is clustered into two groups: deviations and non-deviations. However, the usual clustering methods are biased toward discarding deviations as noise. Consequently, a direct application of these methods will not yield the desired deviations in the database.
As an example of this limitation, consider a popular clustering algorithm in which k clusters are to be found from the data. The algorithm starts by randomly choosing k points (or data items) as cluster centers and then examines the rest of the data points one after another. For each data point, the algorithm finds the cluster center nearest to the examined data point and assigns it to the corresponding data cluster. Once a data point is assigned to a cluster, the center for that data cluster is recomputed. If this algorithm is used to find two clusters, we are likely to end up with two clusters, each having a concentration of a substantial portion of the data points, rather than two clusters where one contains deviations (a small minority) and the other one contains non-deviations (the large majority). In addition to not yielding desired deviations, another limitation of the clustering methods is that they generally require a metrical distance function.
Another approach to the problem of deviation detection is proposed by A. Arning et al., “Method and System for Linearly Detecting Data Deviations in a Large Database”, U.S. Pat. No. 5,813,002. Though this teaching does not require a metrical distance function having certain properties regarding the metrical distance between data elements in the database, it is forced to assume existence of a so-called “dissimilarity” function. For each itemset, defined as a set of one or more data items associated with attribute values, the dissimilarity function allows one to determine its dissimilarity based on a statistical analysis of the attribute values. By tentatively removing all potential subsets from that itemset and determining the influence of this removal on the dissimilarity value the teaching is able to identify deviating data items. However, as admitted by the authors, selection of the particular dissimilarity function is crucial to the success of the proposed method. A priori knowledge seems to be required to select a dissimilarity function in view of the nature of the particular itemset; the authors do not believe in a universal similarity function that works well for all types of itemsets.
Therefore, there remains a significant need for a method for detecting deviations in a database that does not require any a priori knowledge of the database to be analyzed for determining any type of “distance function”, “dissimilarity function” or the like.