The invention relates to clustering techniques that are generally used to classify input data into groups or clusters without prior knowledge of those clusters. More particularly, the invention relates to methods and apparatus for automatically determining cluster centres. An example of such clustering techniques is a Self-Organizing Map, originally invented by Teuvo Kohonen. The SOM concept is well documented, and a representative example of an SOM application is disclosed in U.S. Pat. No. 6,260,036.
The current framework under investigation for describing and analyzing a context has a critical component based on the clustering of data. This clustering is expected to appear at every stage of context computation, from the processing of raw input signals to the determination of a higher order context. Clustering has been well studied over many years and many different approaches to the problem exist. One of the main problems is knowing how many clusters exist in the data. Techniques exist to estimate the number of clusters in a data set, however the methods either require some form of a priori information or assumptions on the data, or they estimate the number of clusters on the basis of an analysis of the data, which may require storing the data, and be computationally demanding. None of these approaches seems entirely suitable for an on-line, unsupervised cluster analysis in a system with limited resources, as would be the case for a context-aware mobile terminal.
Clustering is an important part of any data analysis or information processing problem. The idea is to divide a data set into meaningful subsets so that points in any subset are closely related to each other and not to points in other subsets. The definition of ‘related’ may be as simple as the distance between the points. Many different approaches and techniques can be applied to achieve this goal. Each approach has its own assumptions and advantages and disadvantages. One of the best-known methods from the partition-based clustering class is the K-means algorithm, which tries to adaptively position K ‘centres’ that minimize the distance between the input data vectors and the centres. One of its disadvantages is that the number of the K centres must be specified before the clustering is attempted. In the case of an unknown data set this may not always be possible. The algorithm can be run several times with different values of K and the optimum K is chosen on the basis of some criteria. For an on-line system where the data is not stored, this approach is slow and impractical.
Thus a problem associated with the known clustering techniques is that while it is relatively easy for humans to determine the cluster centres, such a determination is difficult for computers.