The present invention relates to a document classification method and an apparatus therefor, and more particularly to a method an apparatus for automatic classification of Internet homepages, retrieval of literature in an electronic library, retrieval of information about patent applications, automatic classification of electronic newspaper stories, and automatic classification of multimedia information.
In the field of classification and retrieval of information, development of an apparatus for document classification or sentence classification and text classification would be important. For document classification, some of categories have previously been set for determining which categories the individual documents fall into so as to classify the documents into the categories. A result of the classification is then stored in the system whereby the system could obtain a certain knowledge from the information stored therein, for which reason the system can implement the automatic classification based upon the obtained knowledge.
In the prior art, some types of the document classification systems have been known and proposed. The document classification system proposed by Salton et al. has been well-known and is disclosed in G. Salton and M. J. McGill, "Introduction to Modern Information Retrieval", New York : McGraw Hill 1983. A cosine of an angle defined between a frequency vector of words in the document and a frequency vector of words in the category is regarded as a distance between the document and the category and the document classification into the category is implemented so as to minimize the distance between the document and the category.
Another document classification system proposed by Guthrie et al. is also attractive. The words are sorted into clusters. This system is, for example, disclosed in Guthrie et al., "Document Classification by Machine: Theory and Practice," Proceedings of the 15.sup.th International Conference on Computational Linguistics (COLING'94) pages 1059-1063, 1994.
FIG. 1 is a block diagram illustrative of the configuration of the document classification system proposed by Guthrie et al. The document classification system comprises a document input section 505, a document classification section 503, a word cluster distribution memory section 502, a category storage section 501, and a learning section 504. Words or key-words and terms in the document are classified into word clusters so that the document is classified based upon a distribution of appearance of the word clusters. This document classification system proposed by Guthrie et al. can implement the word classification into the clusters at a higher accuracy than that of the document classification system proposed by Salton et al.
A brief description of the document classification system proposed by Guthrie et al. will be made by way of example. Previously, two categories of "baseball" and "soccer" have been set by a user. A determination of the category into which the document falls is made for the document classification into the two categories and subsequent storage of information thereof into the category storage section 501. In this case, one example of the frequency of appearance of the words in the documents classified into the two categories of "baseball" and "soccer" is shown in the following Table 1.
TABLE 1 ______________________________________ Category/Word base pitcher goal game spectator ______________________________________ Baseball 3 1 0 3 2 Soccer 0 0 3 3 2 ______________________________________
The learning section 504 prepares a word cluster "baseball" for the category "baseball" and also prepares a word cluster "soccer" for the category "soccer". If a word did not appear in the document classified into the category "soccer" but did appear one time or more in the document classified into the category "baseball", then this word is classified into the word cluster "baseball". If, however, another word did not appear in the document classified into the category "baseball" but did appear one time or more in the document classified into the category "soccer", then this other word is classified into the word cluster "soccer". The remaining words other than the above are classified into a word cluster "the others".
As a result of the above classification, three word clusters could be obtained as shown in the following Table 2.
TABLE 2 ______________________________________ Cluster "Baseball": base, pitcher Cluster "Soccer": goal Cluster "The Others": game, spectator ______________________________________
Based upon the information about the frequency of appearance of words in individual categories, the words "base" and "pitcher" are classified into the cluster "baseball". The word "goal" is classified into the cluster "soccer". The words "game" and "spectator" are classified into the cluster "the others".
Further, the frequency distribution of word cluster appearance in the documents classified into the two categories can also be obtained as shown in the following Table 3.
TABLE 3 ______________________________________ Category/ cluster "baseball" cluster "soccer" cluster "the others" ______________________________________ Baseball 4 0 5 Soccer 0 3 5 ______________________________________
The learning section 504 provides correspondences of the distributions of the word clusters to the categories for subsequent presumption of the word cluster distribution by use of a Laplace estimation thereby to store the obtained word cluster into the word cluster distribution memory section 502.
The probability equation of the probability parameters used by the Laplace estimation is given by: EQU P(X=x)=(f(X=x)+0.5)/(F+0.5*k) (1)
where "P(X=x)" means a probability of appearance "x" and "f(X=x)" means a probability of appearance "x" in F times of observations. "k" is the number of kinds of the values of "X".
Further, the frequency distribution of word cluster appearance in the documents classified into the two categories can also be obtained as shown in the following Table 4.
TABLE 4 ______________________________________ Category/ cluster "baseball" cluster "soccer" cluster "the others" ______________________________________ Baseball 0.43 0.05 0.52 Soccer 0.05 0.37 0.58 ______________________________________
For the document classification, the document classification section 503 receives new documents from the document input section 505 for subsequent reference thereof to word cluster distribution in individual categories stored in the word cluster distribution memory section 502. The inputted documents are regarded as data to calculate a probability of appearance of data from the word cluster distribution in the individual categories so that the inputted documents are classified into a category corresponding to the largest probability. For example, the above processes are made as follows.
The document classification section 503 receives inputs "spectator", "pitcher", "base", "base" and "goal". The document classification section 503 replaces the words appearing in the inputted document by a word cluster into which the word falls in order to form data such as the cluster "the other", the cluster "baseball", and the cluster "soccer".
The document classification section 503 refers to the word cluster distributions in the categories "baseball" and "soccer" shown on the above Table 4 from the word cluster distribution memory section 502. The above data, such as cluster "the other", the cluster "baseball", and the cluster "soccer" are generated from the word cluster distribution so that a probability of appearance of the data from the word cluster distributions in the categories "baseball" and "soccer" shown on the above Table 4 can be calculated as follows. EQU Log(probability)(data.vertline.category "baseball") EQU =log 0.52+log 0.43+log 0.43+log 0.43+log 0.05=-8.92. EQU Log(probability)(data.vertline.category "soccer") EQU =log 0.58+log 0.05+log 0.05+log 0.05+log 0.37=-15.19.
where the calculation is made in the form of logarithm of the probability.
Since the probability from the category "baseball" is larger than the probability from the category "soccer", then the inputted document is classified into the category "baseball".
The above document classification system proposed by Guthrie et al. has the following three problems.
The first problem is that the words classified into the same word cluster are equivalently processed.
For example, the words "base" and "pitcher" are classified into the same word cluster "baseball". If any one of the words "base" and "pitcher" appears, then the word cluster "baseball" is regarded to have appeared. If, however, the frequency of appearance of the word "base" in the document is higher than that of the word "pitcher" in the document and further the word "base" appears in a new document, then the new document is ideally required to be classified into the category "baseball" at higher accuracy and confidence than when the "pitcher" appears in the new document. Actually, however, the above document classification system proposed by Guthrie et al. could not do such highly accurate classification.
The second problem is that it is difficult to set a threshold value of word appearance frequency when the word cluster is prepared.
In the above document classification system proposed by Guthrie et al., if a word does not appear in the document classified into the category "soccer" but does appear N times or more in the document classified into the category "baseball", then the word is classified into the word cluster "baseball". On the other hand, if a word does not appear in the document classified into the category "baseball" but does appear N times or more in the document classified into the category "soccer", then the word is classified into the word cluster "soccer".
In the above case, setting the threshold value "N" is important and a large issue. If the threshold value "N" is large, tend the numbers of the words to be respectively classified into the word cluster "baseball" and the word cluster "soccer" are decreased whilst the numbers of the words to be classified into the word cluster "the other" are increased. As a result, in many cases, it is difficult to judge which category into which the inputted documents fall.
On the other hand, if the threshold value "N" is small for example N=1, then the number of the words to be classified into the word clusters "baseball" and "soccer" is increased. However, the word appearing only one time and the word often appearing many times are equivalently dealt with. This means that the accuracy of classification is low.
The third problem is that if a word appears in the documents classified into the plural categories but appears to be biased in the document classified into one category, then it is difficult to effectively utilize the word.
Assume that the words appearing in the documents of the categories "baseball" and "soccer" and the frequencies of the appearances thereof are shown in the following Table 5.
TABLE 5 ______________________________________ Category/Word base pitcher goal kick spectator ______________________________________ Baseball 3 1 1 0 2 Soccer 0 0 3 1 2 ______________________________________
With reference to the Table 5, the word "goal" mainly appears in the document of the category "soccer" but also appears in the document of the category "baseball".
In the above case, the above document classification system proposed by Guthrie et al. classifies the word "goal" into the word cluster "the other". This means that the above document classification system could not classify the document, where the word "goal" appears into the category "soccer".
In view of the above circumstances, it had been required to develop a novel document classification system free from the above problems and disadvantages.