Outliers are generally regarded as observations that deviate so much from other observations of the same dataset as to arouse suspicions that they were generated by a different mechanism. See, e.g., Edwin M. Knorr and Raymond T. Ng., “Algorithms for Mining Distance-Based Outliers in Large Datasets”, Proc. 24th VLDB Conf. (New York 1998). The presence of outliers in a dataset can make statistical analyses difficult because it is often unclear as to whether the outlier should be properly included in any such analysis. For example, one must often ask questions such as:                a. Was the value entered correctly or was there an error in the data entry?        b. Were there any experimental problems associated with the suspect value?        c. Is the outlier caused by natural diversity? If so, the outlier may be a correct value.        
After answering such questions, one must decide what to do with the outlier. One possibility is that the outlier was due to chance, in which case the value should probably be kept in any subsequent analyses. Another possibility is that the outlier was due to a mistake and so it should be discarded. Yet another possibility is that the outlier was due to anomalous or exceptional conditions and so it too should be discarded. The problem, of course, is that one can never be sure which of these possibilities is correct.
No mathematical calculation will, with certainty, indicate whether the outlier came from the same or different population than the other members of the dataset. But statistical treatments can help answer this question. Such methods generally first quantify how far the outlier is from the other values in the dataset. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Often, this result is then normalized by dividing it by some measure of scatter, such as the standard deviation of all values, of the remaining values, or the range of the data. The normalized result is then compared with a chart of known values to determine if the result is statistically significant for the population under test.
A well-known example of the above-described type of statistical calculation is Grubbs' method for assessing outliers. Note that this test does not indicate whether or not a suspect data point should be kept for further consideration, only whether or not that data point is likely to have come from the same (presumed Gaussian) population as the other values in the group. It remains for the observer to decide what to do next.
The first step in the Grubbs' test is to quantify how far the outlier is from the other data points. This is done by calculating a ratio Z, as the difference between the suspected outlier and the population mean, divided by the standard deviation of the population (computed by considering all values, including the suspect outlier). If Z is large, the value under test is considered to be far from the others.
  Z  =                          mean        -        value                    SD  
Determining whether or not Z is large requires that the calculated Z value be checked against reference charts. This is necessary because Z cannot ever get truly large in an absolute sense. Because the suspected outlier increases both the calculated standard deviation and the difference between the value and the mean, no matter how the data are distributed, it has been shown that Z can not get larger than (N−1)/√N, where N is the number of values. For example, if N=3, Z cannot be larger than 1.555 for any set of values.
Recognizing this fact, Grubbs and others have tabulated critical values for Z which are used to determine whether the Z calculated for the suspected outlier is statistically significant. Thus, if the calculated value of Z is greater than the critical value in the table, then one may conclude that there is less than a 5% chance that one would encounter an outlier so far from the other data points in the population (in either direction) by chance alone, if all the data were really sampled from a single Gaussian distribution. In other words, there is a 95% probability that the outlier under test does not belong to the population.
Note that this method only works for testing the most extreme value in a sample. Note also that if the outlier is removed, one cannot simply test the next most extreme value in a similar fashion. Instead, Rosner's test should be used. In any event, once an outlier has been identified, it remains for the observer to choose whether or not to exclude that value from further analyses. Or the observer may choose to keep the outlier, but use robust analysis techniques that do not assume that data are sampled from Gaussian populations.
Other methods for determining outliers include various partitioning algorithms, k-means algorithms, hierarchical algorithms, density-based algorithms, clustering techniques, and so on. What is lacking, however, is a straightforward approach that is not computationally intensive so that it can be applied automatically, in real-time.