The Internet generates vast amounts of data every day. For example, blogs, micro-blogs, and transaction platforms such as Twitter, Facebook, and other social networks are generating huge amounts of data every day. Data has penetrated every industry and industrial function. For some businesses, transaction processes, product uses, and human behavior information have all been converted into data.
This generated data can appear vast, disorganized, and not subject to any easily discernible rules, but in terms of its overall distribution, the generated data has certain characteristics that can reflect certain features. The question of how to mine and process this vast and messy data to obtain useful information is an important research topic in the fields of big data and data mining. Data mining is the process of extracting information and knowledge from large quantities of incomplete, noisy, fuzzy, and/or random actual application data. While the exact type of knowledge that may be included in such data may be unknown to people prior to data mining it, the knowledge that could be mined from the data is potentially very useful.
An important type of processing in big data mining is clustering treatment. In clustering treatment, a large quantity of data object sets can be divided into a series of meaningful subsets, i.e., clusters. Cluster analysis consists of dividing a group of data objects into several categories according to similarities and differences. The goal of cluster analysis is to result in the maximum similarity among data belonging to the same category and the minimum similarity between data in different categories. Cluster analysis can be applied to customer classification, customer background analysis, customer purchasing trend forecasting, market segmentation, and other fields.
Cluster analysis generally produces data object grouping in which similar data objects are clustered in a single category. A typical clustering method is the k-means clustering. The k-means clustering technique receives an input of k clusters and receives a database containing N data objects. The k-means clustering technique takes these N data objects and outputs them into the k clusters satisfying the least squares standard. Of the N data objects assigned to the k clusters, those data objects that are in the same cluster have greater similarity with each other and those data objects that are in different clusters have less similarity with each other. Generally, this type of cluster similarity can be computed using a “center object” (center of attraction) obtained from the mean of data objects in each cluster.
The process of implementing the k-means clustering technique specifically comprises:
(1) From N data objects, select any k objects as initial cluster centers.
(2) Using the mean of each cluster's objects (the center object), compute the distances from each object to these center objects, and re-divide the corresponding objects according to minimum distances.
(3) Re-compute the mean (center object) of each cluster (if changed).
(4) Compute the standard measurement function. When a certain condition is met, such as when the function converges, for example, the clustering technique ends. If the condition is not met, then the clustering technique returns to step (2).
However, conventional clustering analyses suffer from at least a few drawbacks. One drawback is that the k-means clustering technique requires the input number of objects, N, to be a fixed value. In situations where the value of N is a varying value, each time N varies during processing, e.g., the value of N increases by 1, there is a need to process the addition of one new data record, which requires that steps (1) through (4), as described above, be re-executed. A second drawback is that in the case of processing high quantity of data, a considerable amount of hardware resources (e.g., memory and processor resources) is needed to execute the clustering process as described above.