The present invention relates to data sets, and more particularly to a method for identifying particular data points of interest in a large data set.
The ability to identify particular data points in a data set that are dissimilar from the remaining points in the set has useful applications in the scientific and financial fields. For example, identifying such dissimilar points, which are commonly referred to as outliers, can be used to identify abnormal usage patterns for a credit card to detect a stolen card. The points in the abnormal usage pattern associated with the unauthorized use of the stolen card are deemed outliers with respect to the normal usage pattern of the cardholder.
Conventional methods employed for identifying outliers typically use an algorithm which relies upon a distance-based definition for outliers in which a point p in a data set is an outlier if no more than k points in the data set are at a distance of d or less from the point p. The distance d function can be measured using any conventional metric.
Although, methods which employ the aforementioned conventional distance-based definition of outliers can be used to identify such points in large data sets, they suffer from a significant drawback. Specifically, they are computationally expensive since they identify all outliers rather than ranking and thus identifying only particular outliers that are of interest. In addition, as the size of a data set increases, conventional methods require increasing amounts of time and hardware to identify the outliers.
A new method for identifying a predetermined number of outliers of interest in a large data set. The method uses a new definition of outliers in which such points are ranked in relation to their neighboring points. The method also employs new partition-based detection algorithms to partition the data points, and then compute upper and lower bounds for each partition. These bounds are then used to identify and eliminate those partitions that cannot possibly contain the predetermined number of outliers of interest. Outliers are then computed from the remaining points residing in the partitions that were not eliminated. The present method eliminates a significant number of data points from consideration as outliers, thereby resulting in substantial savings in computational expense compared to conventional methods employed to identify such points.