Cluster analysis or clustering is the task of grouping a set of data in such a way that data points in the same cluster (e.g. a group of data points) are more similar (in some sense or another) to each other than to those in other clusters. Cluster analysis is frequently employed in exploratory data mining, statistical data analysis, etc., and is useful in many fields.
Various algorithms that differ significantly in regards to what is considered to be a cluster and how such clusters are discovered can be used in cluster analysis. Typical approaches to clustering include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Cluster analysis is therefore something of a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. Using currently available approaches, it is generally necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
For at least these reasons, choosing an optimal clustering algorithm for a large data set can be a challenging decision requiring a deep understanding of differences between available clustering methods and of the data set being analyzed. A user (particularly a non-expert user) who wishes to perform a cluster analysis on a data set can easily choose a sub-optimal clustering method, which can result in less the desirable results.