As automatic data classification techniques, a method of recognizing and classifying data using prior learning data and a method of classifying data without using prior learning data are conventionally known. Both methods are realized by extracting feature values of a plurality of dimensions from data and conducting feature value comparison.
As one example of prior learning, there is a method (for instance, Bayesian estimation) of computing a probability distribution from distribution information of learning data for each classification group, in order to determine which classification group input data belongs to. As another example, there is a method (for instance, a Gaussian mixture model) of approximating distribution information of learning data to a mixture of a plurality of Gaussian distributions, in order to determine which classification group input data belongs to. As still another example, there is a method (for instance, a support vector machine) of setting boundaries between classification groups from distribution information of learning data, in order to determine which classification group input data belongs to. In these methods, learning data needs to be manually prepared before implementing automatic classification, which requires complex registration operations.
On the other hand, as the method of automatically classifying data without using prior learning data, various clustering methods are known. Clustering is a technique of classifying data on the basis of density of distribution of the data itself. Specific examples of the clustering methods include k-means clustering that specifies the number of classes beforehand to perform classification, and a self-organizing map (SOM) which is a neural network that autonomously acquires classification ability according to similarity of an input pattern group.
Moreover, learning and classification of sequential processing type are often demanded in automatic classification of data. As an example, the Linde-Buzo-Gray (LBG) algorithm based on the k-means method is known. For instance, the LBG algorithm is applied to vector quantization that adaptively describes, as code, which representative vector represents each vector, for information compression of an audio signal or an image signal. In a practical sense, however, the LBG algorithm is a technique of finding the representative vector by repeatedly processing data. Accordingly, even though it is sequential processing, there is a problem that a considerable amount of processing time is required. In general, classification accuracy is in a tradeoff relation with a classification result updating speed in sequential processing.
The following describes an example of a structure and processing when actually employing such an automatic classification technique, with reference to FIGS. 22 and 23. FIG. 22 is a block diagram of a data processing apparatus 1000 that performs automatic classification and records the result of the automatic classification in a temporary storage unit. In detail, the data processing apparatus 1000 shown in FIG. 22 includes a feature extraction unit 1100, an automatic classification processing unit 1200, a cluster-element correspondence table updating and recording unit 1300, and a temporary storage unit 1400.
The feature extraction unit 1100 performs, upon input of newly added element data (hereafter also referred to as “additional element”), feature extraction in order to compute coordinates of the additional element on a feature space. For instance, in the case of face image classification, a Gabor wavelet feature value or the like representing a feature value of a face is used. Information about the additional element and the feature value are recorded and managed in the temporary storage unit 1400 so that their correspondence relation is clear.
The automatic classification processing unit 1200 reads, from the temporary storage unit 1400, a classification boundary condition of each cluster obtained as a result of past classification and coordinate information of all element data belonging to a neighboring is cluster on the feature space, when the feature value of the additional element is computed. The automatic classification processing unit 1200 determines which cluster the additional element belongs to. The automatic classification processing unit 1200 then sends information of the additional element (update target element) and information of the cluster (belonging cluster) to which the additional element belongs, to the cluster-element correspondence table updating and recording unit 1300.
After this, the automatic classification processing unit 1200 modifies past classification results according to the addition of the additional element. The automatic classification processing unit 1200 records the modified classification boundary condition of the cluster and the coordinate data of all element data including the coordinates of the additional element, in the temporary storage unit 1400 by one operation. An example of a detailed structure and processing of the automatic classification processing unit 1200 will be described later.
The cluster-element correspondence table updating and recording unit 1300 reads a past cluster-element correspondence table stored in the temporary storage unit 1400, updates the cluster-element correspondence table for the changed part, and records the updated correspondence table in the temporary storage unit 1400.
FIG. 23 is a diagram showing an example of a detailed structure and processing of the automatic classification processing unit 1200. The automatic classification processing unit 1200 shown in FIG. 23 includes a belonging cluster determination unit 1210, a neighboring cluster reclassification unit 1220, a classification boundary condition reading unit 1240, and a classification boundary condition updating and recording unit 1230.
The belonging cluster determination unit 1210 reads the past classification boundary condition of each cluster from the temporary storage unit 1400 through the classification boundary condition reading unit 1240, upon input of the additional element. The belonging cluster determination unit 1210 performs matching in order to determine how close the additional element is to each cluster. As one example, the above-mentioned LBG algorithm based on the k-means method that sequentially performs automatic classification of data without using prior learning data is used for matching. As another example, a hierarchical automatic classification technique or a support vector machine (SVM) capable of sequential processing may be used. For instance, in the SVM, the classification boundary condition is a function indicating a classification boundary surface between clusters. In the hierarchical automatic classification technique, the classification boundary condition is a branch condition at each hierarchical level and each node. Alternatively, as in a Gaussian mixture model (GMM) using prior learning data, each cluster may have a probability density function distributed on the feature space. That is, the classification boundary condition may be any information, so long as it shows a condition for determining which cluster new element data belongs to.
The neighboring cluster reclassification unit 1220 receives the coordinates of the additional element on the feature space and a matching result of the additional element obtained by the belonging cluster determination unit 1210, and extracts the neighboring cluster of the additional element. A cluster is determined as the neighboring cluster when a distance from the additional element to the cluster is smaller than an arbitrary distance index set beforehand. The neighboring cluster reclassification unit 1220 reads all element data belonging to the neighboring cluster from the temporary storage unit 1400, and performs reclassification together with the additional element.
The classification boundary condition updating and recording unit 1230 updates the classification boundary condition of the neighboring cluster and the classification boundary conditions of the existing clusters, on the basis of information of the cluster to which each piece of element data belongs as a result of reclassification and the coordinates of each piece of element data read from the temporary storage unit 1400. The classification boundary condition updating and recording unit 1230 records the updated classification boundary conditions in the temporary storage unit 1400. Moreover, for the element data subject to modification as a result of reclassification, the classification boundary condition updating and recording unit 1230 sends information about the element data and the eventual belonging cluster, to the cluster-element correspondence table updating and recording unit 1300.
Note that, in the case where the neighboring cluster reclassification unit 1220 determines that a distance from the additional element to each cluster is larger than the preset distance index, the neighboring cluster reclassification unit 1220 generates a new cluster to which the element data belongs, and the classification boundary condition updating and recording unit 1230 performs the classification boundary condition update in the same way as above.
The temporary storage unit 1400 is a hard disk, an optical disc, a semiconductor memory, or the like capable of temporarily storing data.
According to such a structure, even when data is sequentially added, the automatic classification result of the newly added data can be reflected while holding past automatic classification results.
Note that, since such an automatic data classification technique employs a statistical approach, the classification result of 100% in accuracy cannot normally be obtained, and the result can merely be probabilistically estimated. This raises a need to successfully analyze the obtained result depending on applications. There is also a system structure based on a premise that the result of automatic classification is manually corrected by the user. In this system, automatic data classification serves as “assistance when the user manually classifies a large amount of data”.
For example, in the case of face image classification, U.S. Pat. No. 7,274,822 and U.S. Pat. No. 7,403,642 describe automatic classification techniques and user interfaces for accurate, efficient annotation (manual classification correction by the user) of face photographs. FIGS. 24A to 24D show examples of annotation.
In FIGS. 24A to 24D, element data subject to classification is indicated by a black spot, and a classification result is indicated by a line. Hereafter, a unit of classification result is referred to as a cluster. Specific examples of annotation include: a splitting operation of splitting one cluster obtained as a result of classification into two (FIG. 24A); a merging operation of merging two clusters into one (FIG. 24B); a removal operation of removing arbitrary element data from one cluster so as to be independent (FIG. 24C); and a metadata assigning operation of assigning a name or information to an entire cluster (FIG. 24D).
The following describes an example of a structure and processing of the data processing apparatus 1000 necessary for performing such annotation, with reference to FIG. 25. The data processing apparatus 1000 shown in FIG. 25 includes the cluster-element correspondence table updating and recording unit 1300, the temporary storage unit 1400, and a user alteration operation detection unit 1500. Note that the cluster-element correspondence table updating and recording unit 1300 and the temporary storage unit 1400 have the same specific structures as described above. Components not directly related to annotation processing are not shown in FIG. 25.
The user alteration operation detection unit 1500 notifies, upon detecting that the user starts an annotation operation, the cluster-element correspondence table updating and recording unit 1300 of the annotation operation. Upon receiving the notification, the cluster-element correspondence table updating and recording unit 1300 reads the cluster-element correspondence table obtained as a result of past classification from the temporary storage unit 1400, to enable recognition of which element data and how the element data has been altered by the user.
The user alteration operation detection unit 1500 then sends information showing the contents of alteration actually made by the user, to the cluster-element correspondence table updating and recording unit 1300. The cluster-element correspondence table updating and recording unit 1300 updates the cluster-element correspondence table using the received information that shows the contents of alteration, and records the updated cluster-element correspondence table in the temporary storage unit 1400.
According to such a structure, it is possible to store and search for annotation results.
In a system of automatically classifying a large amount of data, not only the classification technique but also how classification results are managed is important in practical use. That is, it is necessary to manage automatic classification results by some method that facilitates search, thereby promptly presenting the results upon search. In other words, a high search speed is required. Note that the search speed is closely related to the classification result updating speed mentioned above with regard to the classification technique of sequential processing type. This is because, when partially updating the classification results, a procedure of extracting only the corresponding data, updating the data, and recording the updated data is needed.
To increase the classification result updating speed, a data management method that enables partial classification result updates is necessary. As a representative data management method satisfying such a condition, a method using a hierarchical tree structure is typically known. FIG. 26 shows an example of hierarchical classification. Each cluster is classified in a hierarchical structure, where a lower hierarchical level shows a grouping of relatively close (similar) clusters, and a higher hierarchical level shows classification of clusters in a coarser unit.
The following describes an example of a structure of the data processing apparatus 1000 necessary for performing such search, with reference to FIG. 27. The data processing apparatus 1000 shown in FIG. 27 includes a display cluster determination unit 1600, a cluster-element relation search unit 1700, a display unit 1800, and the temporary storage unit 1400. Note that the temporary storage unit 1400 has the same specific structure as described above. Components not directly related to search processing are not shown in FIG. 27.
The display cluster determination unit 1600 determines a cluster to be displayed according to a user operation or the like, and sends information of the cluster to the cluster-element relation search unit 1700. The cluster-element relation search unit 1700 reads the cluster-element correspondence table obtained as a result of past classification, from the temporary storage unit 1400. The cluster-element relation search unit 1700 performs a query using the received display target cluster, to search for element data belonging to the cluster. After the search, the cluster-element relation search unit 1700 sends display element information showing the target element data, to the display unit 1800. The display unit 1800 displays element-related information about the element data read from the temporary storage unit 1400, on the basis of the display element information.
According to such a structure, automatic classification results and annotation results can be used upon search.