1. Field of the Invention
The present invention relates to an apparatus and a method for classifying a given case into a corresponding category. The present invention is specifically used in fields such as automatically classifying a document, recognizing an image, etc. to classify a case into corresponding categories.
2. Description of the Related Art
When documents are classified in many conventional methods, solution cases (solution document groups) obtained by manually classifying cases (documents) are prepared in advance, and using the solution cases, unknown cases (unknown documents) are classified. In these methods, a feature extraction unit, a classification rule generator, and a field judge are frequently used.
FIG. 1 shows a feature extraction unit. A feature extraction unit 101 shown in FIG. 1 receives a case (document 111), and outputs a feature vector 112 (a vector indicating features of a document). In classifying a document, each word (characteristic word) in the document is defined as a dimension wi (i=1, 2, 3, . . . ) of a feature vector space, and the number of occurrences of each word as each element fi of a feature vector. For example, when a document as shown in FIG. 2 is input, a feature vector as shown in FIG. 3 is output.
FIG. 4 shows a classification rule generator. A classification rule generator 121 shown in FIG. 4 receives a set 131 of solution cases (pairs of feature vectors and categories) and outputs a classification rule 132.
The classification rule 132 is a rule for classifying cases represented by feature vectors into categories, and its format depends on the type of classification rule generator 121.
A vector model, a naïve Bayes classifier, a decision tree, a decision list, a support vector machine (SVM), boosting, etc. are proposed as the classification rule 132.
An input pair of a feature vector and a category is, for example, that shown in FIG. 5. In FIG. 5, a feature vector is associated with the category “software”.
FIG. 6 shows a field judge. A field judge 141 receives a case represented by a feature vector 151, and obtains a list 152 of the certainty pi with which a case belongs to a category ci. For example, when the feature vector as shown in FIG. 7 is input, the certainty list as shown in FIG. 8 is output.
FIG. 9 shows a method of classifying an unknown case (document) using the above-mentioned feature extraction unit, classification rule generator, and field judge. A classification rule generation unit 161 shown in FIG. 9 includes the feature extraction unit 101 and the classification rule generator 121. A field judgment unit 162 shown in FIG. 9 includes the feature extraction unit 101 and the field judge 141.
First, a set 171 of solution cases (solution documents) classified into categories (α, β, γ) of a category system S is input to the classification rule generation unit 161, and a set of feature vectors is generated by the feature extraction unit 101. The set of feature vectors is input to the classification rule generator 121, and the classification rule 132 is generated.
Then, an unknown case (unknown document 172) is input to the field judgment unit 162, and converted to a feature vector by the feature extraction unit 101, the feature vector is input to the field judge 141, and the certainty 173 with which the unknown document 172 belongs to each of the categories α, β, and γ is obtained.
There are a number of fields of application of the classification methods, and relating to document classification there are the following examples of methods of using a word as a feature of a document.    (1) A document is represented by a feature vector, and is classified by an SVM learning system (for example, refer to the non-patent literature 1).    (2) In addition to the above-mentioned system, the precision is improved by using adaptive feedback (for example, refer to the patent literature 1).    (3) For an unknown document, the correlation based on a word extracted from the documents of a specific category is compared with the correlation based on a word extracted from a common document, and it is determined whether or not an unknown document belongs to the specific category (for example, refer to the patent literature 2).
Relating to document classification, the method of using features other than words can be exemplified as follows.    (1) Using a conjunction, the precision can be improved (for example, refer to the patent literature 3).    (2) A portion encompassed by tags is extracted from an SGML (Standard Generalized Mark-up Language) document, and classification is performed using the extracted portion as a feature (for example, refer to the patent literature 4).    (3) The precision is improved by adding link information as a feature (for example, refer to the patent literature 5, 6, and 7).    (4) The precision is improved by learning with the semantic category of a word appearing in a document by using a thesaurus (for example, refer to the patent literature 8).
However, the above-mentioned conventional case classification methods have the following problems.
To classify a document into a specified category system, it is necessary to manually generate a sufficient number of solution documents in advance. The reason for low classification precision when there are a small number of solution documents is that features (words, etc.) appearing in an unknown document to be classified do not appear in a solution document. Since the number of occurrences of the feature is small although the feature appears and has no statistic meaning, the feature appearing in an unknown document to be valid cannot be used as a feature of judgment of a field.
For example, assume that an unknown document such as “among the nations . . . the NATO Summit and the attack against Iraq” is classified. When a word is a feature, the word “NATO” normally relates to “International”, “Military”, etc. However, when the conventional classifying method is used if there is no word “NATO” appearing in any solution document, the word “NATO” does not contribute as a feature in classifying the document. When there are a small number of solution documents, there are a number of cases in which a feature appearing in an unknown document does not appear in a solution document.
Additionally, although there is a method of performing classification by adding a conjunction, link information, etc. as a feature in the above-mentioned conventional methods, if the feature appearing in an unknown document does not appear in a solution document because there are a small number of solution documents, precision is unaffected.
Furthermore, in the above-mentioned conventional methods, the method using a thesaurus excels in that a word not appearing in a solution document in the features of unknown documents contributes to the judgment of a field. However, it is not effective when a word not registered in a thesaurus appears in an unknown document to be classified. To support this case, it is necessary to prepare a comprehensive and expensive thesaurus.
Thus, in the conventional document classification methods, since a feature appearing in an unknown document rarely appears in a solution document when there are a small number of solution documents, there exist the problems that the classification precision is low, it is necessary to generate a dictionary at a high cost to improve the precision, etc.
Non-Patent Literature 1
Thorsten Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, [online], In Proceedings of the European Conference on Machine Learning, Springer, 1998, [retrieved on Feb. 17, 2003], Internet <URL, http://www.cs.cornell.edu/People/tj/publications/jo achims—98a.pdf>
Patent Literature 1
    Japanese Patent Laid-open Publication No. Hei 09-026963Patent Literature 2    Japanese Patent Laid-open Publication No. 2000-250916Patent Literature 3    Japanese Patent Laid-open Publication No. Hei 11-316763Patent Literature 4    Japanese Patent Laid-open Publication No. Hei 10-116290Patent Literature 5    Japanese Patent Laid-open Publication No. 2000-181936Patent Literature 6    International Publication No. 99/14690 pamphletPatent Literature 7    Japanese Patent Laid-open Publication No. Hei 10-254899Patent Literature 8    Japanese Patent Laid-open Publication No. Hei 11-039313