1. Field of the Invention
The present invention relates to a document classification apparatus, which classifies large documents into categories, each of which includes documents with similar contents, and more particularly relates to a document classification apparatus which utilizes extracted keywords from the bodies of documents to automatically classify the documents.
2. Description of the Related Art
For the user in information support services, and in cases where the information in support services is abundant, there is a need for the information to be classified and arranged so the user can select only the related information. For example, there are several types of information support services available to provide information necessary for users in a variety of documents opened through the Internet accessed by personal computers. There are also a variety of documents available such as those owned by groups and those held by individuals. A first type of information support service is where the conditions of the necessary information for a user are entered, and then documents in conformity with the conditions will be retrieved. A second type of information support service is that updated information corresponding to the conditions set by a user in advance is distributed. In cases where the volume of retrieved and/or distributed documents is large, the user will experience difficulty in reading all of the information. However, if the retrieved and/or distributed documents could be classified in advance then presented to the user, the user can select and read only the necessary information.
There is in existence a classification system where a plurality of documents are classified into a plurality of categories. In the system, each document is represented by classification keywords, which may be assigned or extracted by human hands. Correspondence and resemblance of classification keywords between documents enable classification. For example, in "Document information classification method and apparatus" described in the Japanese Patent Application Laid-open No. Hei-8-153121 (hereinafter, referred to as "preceding reference 1"), a document is divided through, for example, a morphological analysis, and then keywords are extracted. Classification is made so that documents with the same keywords can be classified into the same category. Furthermore, resemblance of categories are determined with the resemblance of documents included in the categories. Then several categories are combined; thus, forming a final classification system.
In the "Automatic document classification method and apparatus, and classification dictionary generation method and apparatus" described in Japanese Patent Application Laid-open No. Hei-6-282587 (hereinafter, referred to as "preceding reference 2"), when keywords are extracted, parts of speech such as a subject and an object for the respective keywords are also extracted. Thus, even though a keyword is equivalent to another keyword, it will be determined to be different from the latter due to their different parts of speech. In this reference, documents including the same pair of keywords that frequently appear are put into the same category; thus, a one-dimensional classification is performed.
In the "Automatic document classification method, information space visualizing method, and information retrieval system" described in Japanese Patent Application Laid-open No. Hei-8-263514 (hereinafter, referred to as "preceding reference 3"), keywords are extracted from each document; the occurrence frequencies of the keywords are represented with a row of weights; and each document is represented with a vector composed of the row of the weights. With vectors, documents will be classified so that documents with similar vectors are placed close to each other in a 2-dimensional matrix. Two axes in the matrix have no specific meaning, yet are defined equally.
In the "Document retrieval system" described in Japanese Patent Application Laid-open No. Hei-8-320881 (hereinafter, referred to as "preceding reference 4"), keywords that a user has prepared are utilized as classification keys, wherein documents will be classified in a 2-dimensional matrix. In other words; a document including both vertical line classification keys that correspond to the row and horizontally axial classification keys that correspond to the column in the matrix, is classified into a cell in the matrix.
A first problem on the conventional document classification apparatus is that their classification axes are meaningless to the objects which will be classified. In the "Document information classification method and apparatus" described in the preceding reference 1 and "Automatic document classification method and apparatus, and classification dictionary generation method and apparatus" described in the preceding reference 2, to construct a system with a bottom-up approach, similar documents are combined. This system is not always a good classification system along with the meaningful classification axes, from the top-down viewpoint. In "Automatic document classification method, the information space visualizing method, and the information retrieval system" described in the preceding reference 3, documents are classified into a 2-dimensional matrix where the vertical line and horizontal axis have no specific meaning. Therefore, a user has difficulty in making use of the classification results. Furthermore, in "Document retrieval system" described in the preceding reference 4, classification is performed with user selected keywords, however, the user has difficulty in selecting preferable keywords to the objects to be classified. If selection of the keywords is not appropriately made, a plurality of documents will possibly be intensively classified into the same category.
A second problem of the conventional document classification apparatus is that no combination of a plurality of classification axes can be made in accordance with objects to be classified. In "Document information classification method and apparatus" described in the preceding reference 1 and "Automatic document classification method and apparatus, and classification dictionary generation method and apparatus" described in the preceding reference 2, classification is made with a tree-structural system. However, it cannot be made with the combination of several classification viewpoints. In "Automatic document classification method, information space visualizing method, and information retrieval system" described in the preceding reference 3, classification axes have no specific meaning, and no combination of several axes can exist. No combination of several classification axes to objects are allowed, causing difficulty in the comparison of document distribution from several viewpoints to that from other viewpoints when the classified structure from the former viewpoints differs from that of the latter viewpoints. When classification is performed into a 2-dimensional matrix using user selected keywords, in a way shown in the "Document retrieval system" described in the preceding reference 4, classification from several viewpoints will be allowed. However, not all users are able to prepare horizontally axial keywords to appropriately correspond to the vertical line keywords. This may lead to a situation where few documents with both vertical line keywords and horizontally axial keywords exist. Therefore, classification suitable for objects is not necessarily made.