1. Field of the Invention
The present invention relates to natural language processing which includes the classification of documents. More particularly, the invention permits one to exactly extract the distinction between document sets, thereby improving the processing performance.
2. Description of the Related Art
Document classification is a technology for classifying documents into predetermined groups, and has become more important with an increase in the circulation of information. Regarding the document classification, various methods, such as the vector space model, the k nearest neighbor method (kNN method), the naive Bayes method, the decision tree method, the support vector machines method, and the boosting method, have heretofore been studied and developed. A recent trend in document classification processing has been detailed by M. Nagata and H. Hira in “TEXT CATEGORIZATION—SAMPLE FAIR OF LEARNING THEORIES” contained in Proceedings of the Information Processing Society of Japan, Vol. 42, No. 1 (January 2001). In any of the classification methods, information on a document class is described in any form and is collated with an input document. It shall be called the “class model” below. The class model is expressed by, for example, the average vector of documents belonging to each class in the vector space model, the set of the vectors of documents belonging to each class in the kNN method, and a set of simple hypotheses in the boosting method. In order to achieve a precise classification, the class model must precisely describe each class. It may be said that, in the high-performance classification methods hitherto proposed, the class model describes each class more precisely.
In this regard, although many of the classification methods aim at the preciseness of the description of the class model, they do not consider class-model overlapping. In the vector space model or the kNN method, for example, the class model of a certain class also includes information matching with another class. If an overlap exists between the class models, there is a likelihood that it will exist between a certain input document and the class to which the input document does not belong, and can cause a misclassification. In order to eliminate the cause for the misclassification, the class model needs to be described by finding the distinctive information of each class so that class-model overlapping may be reduced.