It is often desirable to be able to detect instances within a set of data that may be said to differ from the rest sufficiently to constitute outliers or anomalies. For example, within the medical domain, one might wish to identify alarming changes in physiological parameters such as heart rate in patients undergoing continuous monitoring, or abnormalities in the appearance of the brain on magnetic resonance or computed tomography scans. Examples in other domains include being able to identify anomalous movements in time-varying data such as stock markets, or isolating defective items going through a production line based on their video snapshots. Where the data has only one dimension (e.g. each instance of data has a single scalar value) and the distribution of values is simple (e.g. a normal distribution) a criterion of anomaly is easy to derive. However, where the data is high dimensional (e.g. moving or static images, time series, volume data, etc) and/or the distribution of values is complex (e.g. multimodal) a satisfactory criterion of anomaly is very hard to find.
One promising approach is to take a “standard” dataset known a priori to be free of anomaly and then to define what is normal or anomalous by indexing the relation of the test datum to those instances of data in the standard dataset that are the most similar to it, what might be called its neighbours. By varying the number of neighbours, k, one uses for this comparison, one can manipulate the scale of deviation from the usual or expected by which anomaly is defined.
Thus, one simple measure of the anomaly of a given data point, x, is the average distance to its k-nearest neighbours, a measure known as gamma (γ) (Harmeling, Dornhege, Tax, Meinecke, & Müller, 2006):
            γ      k        ⁡          (      x      )        =            1      k        ⁢                  ∑                  i          =          1                k            ⁢                          ⁢              d        ⁡                  (                      x            ,                                          nn                i                            ⁡                              (                x                )                                              )                    
Unfortunately, γ unhelpfully varies with the natural density of the points—dense regions will have low values and sparse regions high ones—making the labelling of anomalies density dependent (Harmeling et al., 2006). This is illustrated in FIG. 1, where a heterogeneous synthetic two-dimensional dataset is labelled by the γ score of each point: the larger the diameter of a point the higher the γ score. It is easy to see that members of the smaller, denser cluster will generally have a smaller value of γ than members of the larger, sparser cluster even though they belong equally strongly to their respective clusters. Thus, if anomaly were to be determined by a fixed threshold of γ, either too many members of the sparse cluster will be labelled as anomalous or too few of the denser one: it is easy to see that the score is confounded by local differences in density.