The present invention relates to a hierarchical clustering technique, and in particular to an information processing apparatus, method and program for determining a weight of each feature which makes it possible to hierarchically cluster content expressed as a combination of physical features so that the degree of subjective similarity is reflected.
There is a demand for clustering multimedia content, such as voices, images, sentences and websites, so that degrees of subjective similarity among pieces of emotional content that a person feels from the content can be reflected. Here, the emotional content means not only feelings a person clearly expresses, such as anger and pleasure, but also such that can be felt by a person but cannot necessarily be classified in words, including subtle mental attitudes. Whether pieces of emotional content are subjectively similar or different depends on the degree of subtlety a receiving-side person is ready to distinguish. Therefore, in order to satisfy the above demand, it is desirable to use hierarchical clustering in which the number of clusters is not determined beforehand. In general, multimedia content is expressed by combination of physical features. However, all physical features do not necessarily have equal importance. Therefore, it is necessary to learn the weight of each physical features so as to reflect the degree of subjective similarity on a result of clustering.
As a conventional clustering technique in which the degree of subjective similarity from a viewpoint of a user is reflected, an article by Eric P. Xing, Andrew Y. Nq, Michael I. Jordan, Stuart Russell, entitled “Distance metric learning, with application to clustering with side information”, In Advances in Neural Information Processing Systems 15, Vol. 15 (2002), pp. 505-512. exists. In constrained clustering disclosed in the aforementioned article, a pair to be necessarily included in the same cluster (ML: must-link) and a pair to be necessarily included in different clusters (CL: cannot-link) are used as training data. As shown in FIG. 1(a), at the time of learning, training data (ML/CL pairs) 100 is inputted to a supervised clustering section 110, and a supervised clustering algorithm is adjusted to satisfy constraints of the ML/CL pairs. At the time of operation, test data 105 is inputted to the supervised clustering section 110, and a clustering result 115 is acquired with the use of the adjusted algorithm.
As described above, the constrained clustering technique of Xing, et al requires preparation of ML/CL-type constraint, as training data. However, whether or not to classify elements of a certain data pair into the same cluster depends on the number of classification clusters. For example, even if the data pair is such that should be ML in the case of classification into four clusters, it may be appropriate that the data pair should be CL in the case of more detailed classification into eight clusters. Therefore, training data in the ML/CL type cannot be created unless the number of classification clusters is determined beforehand, and the technique cannot be applied to hierarchical clustering in which the number of clusters is not determined beforehand.
As another conventional clustering technique in which the degree of subjective similarity from a viewpoint of a user is reflected, an article by Matthew Schultz, Torsten Joachims, entitled “Learning a distance metric from relative comparisons”, In Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, Mass., (2004) exists. In semi-supervised clustering disclosed by Schultz, et al, for each of sets (X, A and B) of training data, it is specified by a user which of A and B, X is closer to (hereinafter, such training data is referred to as XAB-type similarity data). As shown in FIG. 1(b), at the time of learning, training data (XAB-type similarity data) 120 including the user's specification is inputted to a supervised weight learning section 125, and a weight 130 of each physical feature is determined so that a relationship indicated by the training data 120 is satisfied. At the time of operation, test data 135 is inputted to an unsupervised clustering section 140, and unsupervised clustering is performed with the use of the weight 130 of each physical feature, and a clustering result 145 is acquired.
On the other hand, in the semi-supervised clustering disclosed by Schultz, et al, it is sufficient to prepare training data indicating which of A and B, X is closer to, and, therefore, the training data can be created even if the number of classification clusters is not determined first. The training data, however, has a problem that about ⅓ of the training data is invalid for evaluating a clustering result. For example, it is assumed that, as a result of hierarchical clustering of three pieces of content, X, A and B, A and B are combined first before combination with X as shown in FIG. 3(a). Then, which of A and B, X is closer to cannot be judged from a clustering result, and, therefore, it is not possible to evaluate the clustering result using training data. Although it is possible that the weights of features can be learned by increasing the number of training data, the learning of the weights is performed in the direction of increasing invalid data because it results in higher scores. After all, it becomes necessary to devise the design of a learning algorithm, and complicated processing is required.
Other conventional techniques found in prior-art technique search for the present invention will be described below.
In JP2007-334388A, problems to be solved are to make it possible to put together documents which a person feels to be similar to one another, into the same cluster with a high accuracy and to obtain a clustering result on which a user's intention is reflected. JP2007-334388A discloses a clustering method in which common words appearing in common in documents in multiple clusters specified by a user are acquired; among the common words, such common words are selected that the frequency of appearance in the clusters specified by the user is relatively high in comparison with the frequency of appearance in clusters which have not been specified by the user; the common words are recorded in keyword storage means as keywords; and, at the time of clustering the same or another set of documents, the clustering is performed with the influence of the keywords recorded in the keyword storage means are emphasized.
In the technique of JP2007-334388A, when clustering is performed multiple times, feedback is given by a user about correct pairs and wrong pairs in the previous clustering result. However, since this feedback can be said to be ML/CL-type training data, the technique of JP2007-334388A cannot be applied to hierarchical clustering in which the number of clusters is not determined first because of the same reason described with regard to Xing, et al.
JP2006-127446A discloses an image processing apparatus distinguishing image information by a classifier learned on the basis of training data, the image processing apparatus including: feature extracting means extracting features from image information; combined feature calculating means calculating a combined feature, which is a combination of the features extracted by the feature extracting means; learning means performing learning the classifier by the features calculated by the combined feature calculating means and the features extracted by the feature extracting means; collation means applying training data to the discriminator learned by the learning means to collate a discrimination result with an ideal classification result given from the outside; and optimization means changing a method for combination of features by the combined feature calculating means on the basis of a result of the collation means.
In JP2006-127446A, a k-means method and a k-nearest neighbor method are given as clustering methods. That is, the technique of JP2006-127446A is to be applied to a nonhierarchical clustering method, and it is not possible to apply the technique of JP2006-127446A to hierarchical clustering in which the number of clusters is not determined first.
JP07-121709A discloses a pattern identification apparatus including: means for referring to an identification space prepared in advance to perform pattern identification of a sample pattern by a nearest neighbor method; means for determining the confidence of identification on the basis of an identification distance sequence obtained by the pattern identification; and means for judging whether or not the identification space referred to is a good identification space for the identification of the sample pattern. JP07-121709A further discloses means for preparing an identification space for a category which the sample pattern may come under; control means for, when receiving a identification result given by the identification means on the basis of the confidence of identification obtained by referring to an identification space prepared in advance and performing pattern identification of an already-known sample pattern, and the judgment result indicating that the identification space is not a good identification space, controlling the creating means to cause a new identification space using features different from those of the identification space prepared in advance to be prepared for the category of the already-known sample pattern; and means for accumulating the identification space prepared in advance and the identification space prepared newly, the identification spaces being hierarchically associated with each other.
In the technique of JP07-121709A, by repeatedly continuing a pattern recognition process for a category which cannot be recognized by pattern recognition, a hierarchical structure can be obtained as a result. Such a hierarchical structure, however, does not indicate degrees of similarity among data. The clustering disclosed in JP07-121709A is clustering in which the number of clusters is determined first. As described above, it is not possible to, even if the technique of JP07-121709A is used, satisfy the demand for clustering multimedia content so that degrees of subjective similarity among pieces of emotional content that a person feels from the content can be reflected.
In JP2002-183171A, a problem to be solved is to provide a document clustering system capable of classifying document data into the number of clusters according to clustering targets. JP2002-183171A discloses a document clustering system performing singular value decomposition of a set of feature vectors of documents created by feature vector creating means 103; creating a document similarity vector 108 for calculating degrees of similarity among documents from a singular value decomposition result 106; using the document similarity vector for a target document to calculate a distance between the document and the cluster centroid, by cluster creating means 110; increasing the number of dimensions of the document similarity vector used for the first classification to further perform the second classification of the same target document; comparing results of both classifications and setting clusters with little change as stable clusters; excluding documents of the stable clusters from targets and selecting target documents for the next classification by the cluster creating means, by data selecting means 109; and repeating this trial.
In the technique of JP2002-183171A, each of two kinds of feature vectors is used to perform clustering, and such clusters that are obtained from both of the results are adopted as stable clusters. Consequently, training data is not required. Therefore, in the technique of JP2002-183171A, it is not possible to learn the weights of features so that such clustering can be performed that degrees of subjective similarity among emotional content that a person feel is reflected.
Mikhail Bilenko, Sugato Basu, Raymond J. Mooney, in an article entitled “Integrating Constraints and Metric Learning in Semi-Supervised Clustering”, Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, July, pp. 81-88 disclose a semi-supervised clustering method in which the conventional constraint-based method and distance-based method (distance-function learning method) are integrated.
In the technique of Mikhail Bilenko, Sugato Basu, Raymond J. Mooney, “Integrating Constraints and Metric Learning in Semi-Supervised Clustering”, Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, July, pp. 81-88, both of ML/CL-type constraint data and XAB-type relative-similarity data are used as training data. Therefore, the technique of Bilenko, et al includes both of the problem described with relation to the technique of Xing, et al and the problem with relation to the technique of Schultz, et al.