The outlier detection problem is an important one for very high dimensional data sets. Much of the recent work has focused on finding outliers for high dimensional data sets which are based on relatively low dimensionalities, for example, up to 10 or 20. However, the typical applications in which points are outliers may involve higher dimensionality such as, for example, 100 or 200. For such applications, more effective techniques are required for outlier detection.
Many data mining algorithms described in the literature find outliers as an aside-product of clustering algorithms. Such techniques typically find outliers based on their nuisance value rather than using techniques which are focused towards detecting deviations, see, e.g., A. Arning et al., “A Linear Method for Deviation Detection in Large Databases,” Proceedings of the KDD Conference, 1995. Outliers are however quite useful based on their value for finding behavior which deviates significantly from the norm. In this invention, we carefully distinguish between the two, and develop algorithms which generate only outliers which are based on their deviation value.
Although the outlier detection definition described in S. Ramaswamy et al., “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proceedings of the ACM SIGMOD Conference, 2000 has some advantages over that provided in E. Knorr et al., “Algorithms for Mining Distance-based Outliers in Large Data Sets,” Proceedings of the VLDB Conference, September 1998, both of them suffer from the same inherent disadvantages of treating the entire data in a uniform way. However, different localities of the data may contain clusters of varying density. Consequently, a new technique which finds outliers based on their local density was proposed in M. M. Breunig et al., “LOF: Identifying Density-Based Local Outliers,” Proceedings of the ACM SIGMOD Conference, 2000, which finds the outliers based on their local neighborhoods; particularly with respect to the densities of these neighborhoods. This technique has some advantages in accounting for local levels of skews and abnormalities in data collections. In order to compute the outlier factor of a point, the method in the M. M. Breunig et al. reference computes the local reachability density of a point o by using the average smoothed distances to a certain number of points in the locality of o.
Thus, the above-mentioned techniques proposed in the above-cited E. Norr et al. reference, the S. Ramaswamy et al. reference and the M. M. Breunig et al. reference try to define outliers based on the distances in full dimensional space in one way or another. Recent results have also shown that when the distances between pairs of points are measured in the full dimensional space, all pairs of points are almost equidistant, see, e.g., K. Beyer et al., “When is Nearest Neighbors Meaningful?” Proceedings of the ICDT, 1999. In such cases, it becomes difficult to use these measures effectively, since it is no longer clear whether or not these are meaningful. In the context of the algorithms proposed in the above-cited E. Knorr et al. reference, a very small variation in d can result in either all points being considered outliers or no point being considered an outlier. The definition in the S. Ramaswamy et al. reference is slightly more stable since it does not rely on the use of such a parameter which is difficult to pick a priori. However, for high dimensional problems, the meaningfulness of the k-nearest neighbor in high dimensional space is in itself in doubt; therefore, the quality of outliers picked by such a method may be difficult to estimate. The same problem is relevant for the method discussed in the M. M. Breunig et al. reference in a more subtle way; since the local densities are defined using full dimensional distance measures.
For problems such as clustering, it has been shown (e.g., in C. C. Aggarwal et al., “Fast Algorithms for Projected Clustering,” Proceedings of the ACM SIGMOD Conference, 1999 and C. C. Aggarwal et al., “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proceedings of the ACM SIGMOD Conference, 2000) that by examining the behavior of the data in subspaces, it is possible to design more meaningful clusters which are specific to the particular subspace in question. This is because different localities of the data are dense with respect to different subsets of attributes. By defining clusters which are specific to particular projections of the data, it is possible to design more effective techniques for finding clusters. The same insight is true for outliers, because in typical applications such as credit card fraud, only the subset of the attributes which are actually affected by the abnormality of the activity are likely to be applicable in detecting the behavior.
In order to more fully explain this point, let us consider the example illustrated in FIGS. 1A-1D. In the example, we have shown several 2-dimensional cross-sections of a very high dimensional data set. It is quite likely that for high dimensional data, many of the cross-sections may be structured; whereas others may be more noisy. For example, the points A and B show abnormal behavior in views 1 (FIG. 1A) and 4 (FIG. 1D) of the data. In other views, i.e., views 2 (FIG. 1B) and 3 (FIG. 1C), the points show average behavior. In the context of a credit card fraud application, both the points A and B may correspond to different kinds of fraudulent behavior, yet may show average behavior when distances are measured in all the dimensions. Thus, by using full dimensional distance measures, it would be more difficult to detect points which are outliers, because of the averaging behavior of the noisy and irrelevant dimensions. Furthermore, it is impossible to prune off specific features a priori, since different points (such as A and B) may show different kinds of abnormal patterns, each of which use different features or views.
Thus, the problem of outlier detection becomes increasingly difficult for very high dimensional data sets, just as any of the other problems in the literature such as clustering, indexing, classification, or similarity search. Previous work on outlier detection has not focused on the high dimensionality aspect of outlier detection, and has used methods which are more applicable for low dimensional problems by using relatively straightforward proximity measures, e.g., the above-mentioned E. Knorr et al. and S. Ramaswamy et al. references. This is very important for practical data mining applications which are mostly likely to arise in the context of very large numbers of features. The present invention focuses for the first time on the effects of high dimensionality on the problem of outlier detection. Recent work has discussed some of the concepts of defining the intentional knowledge which characterizes distance-based outliers in terms of subsets of attributes. Unfortunately, this technique was not intended for high dimensional data, and the complexity increases exponentially with dimensionality. As the results in E. Knorr et al., “Finding Intentional Knowledge of Distance-based Outliers,” Proceedings of the VLDB Conference, September, 1999 show, even for relatively small dimensionalities of 8 to 10, the technique is highly computationally intensive. For even slightly higher dimensionalities, the technique is likely to be infeasible from a computational standpoint.