Data analysis, a process of analyzing data from different perspectives and summarizing it into useful information, facilitates an organization in understanding and interpreting the data. For instance, data is analyzed by grouping objects in a dataset into groups (e.g., clustering objects), detecting one or more outliers to the group (e.g. anomaly detection), and the like. Some of the techniques used to analyze data are from data mining, pattern recognition, machine learning, and the like. One technique from machine learning is the K-Nearest Neighbor (KNN) algorithm. In the KNN algorithm, the affinity or closeness of the objects in the dataset is determined. The affinity is also known as distance in a feature space between objects in the dataset. Further, based on the determined distances, the objects are clustered and the outlier is detected for data analysis.
Specifically, the KNN algorithm is technique to find distance-based outliers based on the distance of an object from its kth-nearest neighbors in the feature space. Each object is ranked on the basis of its distance to its kth-nearest neighbors. The farthest away object is declared the outlier. In some cases the farthest objects are declared outliers. That is, an object in a data set is an outlier with respect to parameters, such as, a k number of neighbors and a specified distance, if no more than k objects in the data set are at the specified distance or less from the object. As well the KNN is a classification technique that uses supervised learning. An item is presented and compared to a training set with two or more classes. The item is assigned to the class that is most common amongst its k-nearest neighbors. That is, compute the distance to all the items in the training set to find the k nearest, and extract the majority class from the k and assign to item. However, the technique to find the nearest neighbors based on the distance can be computationally intensive as it requires calculation of the distance of objects under consideration to every other object in the dataset and ordering of them by the distance with the lowest distance first. This may affect the performance of a computer system, especially when the dataset is large, in terms of memory and processing time complexity.