1. Field of the Invention
The present invention relates to a clustering processing method, a clustering processing apparatus and a computer program for dividing a sample group.
2. Description of the Related Art
Document digitization goes beyond merely obtaining image data by reading a paper document using a scanner or the like. For example, processing for dividing image data into regions having different qualities constituting a document, such as characters, graphics, photos and tables, and converting the respective regions to data in the most suitable format according to purpose, such as converting character regions to character codes, graphics regions to vector data, background and photo regions to bitmap data, and table regions to structure data, is performed in document digitization processing.
As for conversion methods to vector data, in Japanese Patent Laid-Open No. 2007-158725, region division is performed using clustering processing, the contours of the regions are extracted, and the extracted contours are converted to vector data. Japanese Patent Laid-Open No. 2008-206073 discloses an image processing method that involves separating an image into background and foreground, converting the foreground to vector data, and performing data compression on the background with a dedicated background method. Japanese Patent Laid-Open No. 2006-344069 discloses a method for removing noise clusters remaining after clustering processing has been performed on an original read with a scanner.
Incidentally, in terms of methods for dividing an image into regions using clustering processing, a nearest neighbor clustering method is known. The nearest neighbor clustering method involves searching for a cluster having the nearest feature vector in feature space by comparing the feature vector of a target pixel with the representative feature vector of each cluster. If the distance is less than or equal to a prescribed threshold, the target pixel is allocated to that cluster. If not, a new cluster is defined and the target pixel is allocated to that cluster. Note that generally color information (RGB pixel values) is used here as the elements of feature vectors (feature amounts). The centroids of clusters are generally used as the representative feature vectors of clusters. That is, the average value of the feature vectors (color information) of pixels allocated to a cluster.
With the nearest neighbor clustering method, the distance from the representative feature vectors of the all of the clusters must be computed for each pixel. In response to this, a color image processing apparatus has been disclosed in Japanese Patent Laid-Open No. 11-288465, for example, in order to reduce the number of calculations. With the conventional technique, clustering is performed based on the feature vectors (color information) of the target pixel and neighboring pixels, and cluster grouping is then performed based on the color information and geometric information of the clusters. Here, geometric information refers to coordinate information or the like representing the distance between clusters in real space.
However, with the conventional technique in Japanese Patent Laid-Open No. 11-288465, because a cluster is newly defined and the pixel of interest is allocated to that cluster in the case where the feature vectors of the target pixel and neighboring pixels are far apart, a large number of clusters are defined. Thus, there is a problem in that the processing time required for grouping increases.