Clustering aims at organizing or indexing a large dataset in the form of a collection of patterns or clusters based on similarity. Each cluster corresponds to a subset of the original dataset. From a probabilistic point of view, this means finding the unknown distribution of datum vectors in a given data set, and the distribution pattern in each cluster is a local component of the global data distribution. The output of cluster analysis answers two main questions: how many clusters there are in the dataset, and how they are distributed in the entire data space. Since correct classification of the data vectors is not known a priori, the goal is simply to find, for the given dataset, the best possible cluster patterns extremizing a predefined objective function. Plausible application areas for clustering are, for example, data mining and pattern recognition, among others well-known in the art.
The purpose of vector quantization is to represent a multi-dimensional dataset by a reduced number of codebook vectors that approximate the original dataset with minimal amount of information loss as possible for the given range of compression factors. Examples of use of vector quantization include data compression, and in that context, the same pattern searching algorithm used in cluster analysis is also applicable in designing a codebook as the concise representation of a large dataset.
Conventional systems that use clustering and vector quantization typically use a predefined input parameter, such as a model complexity parameter, to determine a number of cluster patterns or a number of codebooks to be used in an output for a clustering or vector quantization task respectively. However, the output model will not be suitable for its intended purpose if this input parameter is sub-optimal. For example, a vector quantization task given a less desirable model complexity parameter will not minimize the amount of information loss for a given dataset during a data compression process.