1. Field of the Invention
The present invention relates to pattern recognition and data analysis in image processing and statistical processing, and more particularly to an apparatus and a method for data clustering.
2. Description of the Related Art
Recently, demand for keeping or transmitting documents in digitized form rather than in paper form has been increasing. The digitization of a document referred to herein does not only refer to reading a document on paper with a device such as a scanner and obtaining image data. For example, in document digitization processing, image data is separated into distinct areas that constitute a document, such as character areas, graphic areas, photograph areas, and table areas. These areas are then converted into data in respective optimal formats, such as character code for the character areas, vector data for the graphic areas, bitmap data for the background areas and photograph areas, and structured data for the table areas.
As a technique of conversion into vector data, Japanese Patent Laid-Open No. 2007-158725 discloses an image processing apparatus. This image processing apparatus divides image data into areas, extracts the outline of each area, and converts the extracted outline into vector data. The area division method used in this image processing apparatus will be described below.
First, image data is divided into areas by Nearest Neighbor clustering. Nearest Neighbor clustering searches for a cluster such that the distance to a feature vector of a processing-target sample (e.g. RGB values of a target pixel) is the shortest. If the shortest distance is below a predetermined threshold, the processing-target sample is assigned to the cluster. Otherwise, a new cluster is defined to assign the processing-target sample to the newly defined cluster. In the clustering processing for image data, color information (pixel values of R, G, and B) is generally used as a feature vector. The centroid of a cluster, i.e., the average of feature vectors (color information) of all samples (i.e., all sampled pixels) belonging to the cluster, is generally used as a representative feature vector of the cluster.
Integration processing is then performed for the areas divided by Nearest Neighbor clustering. In this processing, a target value for the number of areas (the target number of clusters) is set, and clusters are integrated until the number of clusters fall within the target value. Specifically, distances between the feature vectors of the clusters are calculated, and two clusters with the shortest distance between them are integrated into one cluster.
The clustering processing can also be applied to applications other than image data. For example, the clustering processing can be applied to data mining, such as discovering groups of users or customers having the same tendencies from a database of a Web access history or from a database of a sales history in a POS (Point of Sales) system by sorting data into groups with similar characteristics.
In assigning data to clusters through the clustering processing, the computation time increases according to the number of clusters. This is because the number of times the distance between feature vectors is calculated increases with the increase of the number of clusters in both the Nearest Neighbor clustering processing and the cluster integration processing.
As a conventional technique for solving the above inconvenience, Japanese Patent Laid-Open No. 8-30787 discloses an image area dividing method and an image area integrating method. In this conventional technique, image data is divided into rectangles and each rectangle is divided into areas through the clustering processing. In order to prevent unnatural area division at the boundaries of the rectangles, the clustering processing is performed twice or more around the boundaries of the rectangles, and then integration processing is performed. According to this conventional technique, performing the clustering processing in parallel for each rectangle enables faster area division.
Besides image data, the clustering processing is applied to other forms of data in order to group similar data together for purposes such as data analysis.
However, in the parallel processing according to the conventional technique, the clustering processing is performed for each divided rectangle and therefore the number of clusters increases according to the number of divided rectangles. Further, in the conventional technique, the cluster integration processing is performed after the clustering processing is all finished. The more times the data is divided for the clustering processing, the more clusters to be subjected to the integration processing are generated. This disadvantageously causes the increase of the processing time required for the cluster integration processing.