The present invention in general relates to a technology used for categorizing a large amount of text information or the like. More particularly, this invention relates to a technology which can improve categorizing accuracy and efficiency by selecting a categorizing method having the highest categorizing accuracy from among a plurality of categorizing methods.
Recently, a huge amount of text information is easily available by using the Internet. Hence, a technique which can grasp the contents of these huge amount text information and efficiently extract necessary text information from these information is desired. This is because if these text information are categorized into determined categories, it is convenient for search at the time of utilizing the text information in a later stage, or when finding related text information.
Conventionally, these huge amount of text information have been categorized manually into optimum categories in a categorizing system consisting of a plurality of categories, by a person in charge of categorizing, an originator of the text information or a person using the text information, who judges the contents of new text information. Moreover, as another categorizing method, there is a method in which the content of new text information is analyzed, utilizing a computer system, and the text information corresponding to the category is automatically categorized based on the analysis result. With the former categorizing method, the cost is extremely high, and with the latter categorizing method, there are problems in the number of categories and categorizing accuracy for obtaining practical results. Accordingly, means and methods for effectively solving these problems have been desired earnestly.
At present, a large amount of computerized text information have been circulated, and categorizing based on the implication of the text information becomes important problem from a standpoint of efficient search/usage of the text information. As means for solving such a problem, an information categorizing apparatus that automatically executes the categorizing operation of the text information has been used in every field.
Moreover, as a method for deriving a categorizing method of text information based on categorizing examples of given text information, and thereafter, categorizing new text information based on the categorizing method, there have been so far disclosed various categorizing methods in, for example, Japanese Patent Application Laid-Open Nos. 11-328211, 1-296552, 11-167581, 11-161671 and the like. Conventional categorizing methods will now be listed below:
(1) a statistical categorizing method based on a stochastic model;
(2) a categorizing method for performing automatic categorizing by means of learning; and
(3) a categorizing method for performing automatic categorizing by preparing a rule for categorizing text information into each category, and using this rule.
The categorizing method of (1) can find a general categorizing tendency, but cannot find a fine categorizing tendency. The categorizing method of (2) can obtain high categorizing accuracy, when the number of categories is less than several tens, but if the number increases to several tens or more, the categorizing accuracy decreases. Furthermore, the categorizing method of (3) requires huge cost for preparation of the rule and maintenance. As described above, the categorizing methods of (1) to (3) have both merits and demerits.
FIG. 18 is a block diagram showing a construction of a conventional information categorizing apparatus. In this figure, categorizing sample data 2 is category-related correct data comprising a plurality of texts, in which it is predetermined which text is to be categorized in which category. A feature element extraction section 1 extracts from each text a feature element (word) respectively representing the feature of each category from the categorizing sample data 2.
Here, at the time of extraction of the feature element, it is necessary to efficiently extract the feature element which can increase discrimination ability of each category. Therefore, in the feature element extraction section 1, a feature element extraction method for increasing the discrimination ability is used, based on the frequency of appearance of the feature element. As this feature element extraction method, a plurality of methods has been heretofore proposed. Moreover, as for the attribute of the feature element, there is adopted a method in which several parts of speech are specified, or the like.
The categorizing learning information generation section 3 calculates the feature of each category, respectively, from the feature element extracted by the feature element extraction section 1, and generates categorizing learning information 4 as the categorizing result. As the categorizing learning method in this categorizing learning information generation section 3, a plurality of methods have been heretofore proposed. The categorizing learning information 4 is the information representing the correspondence between the situation of the feature element and the category. An automatic categorizing section 5 categorizes a new text group 6 consisting of a plurality of texts to be categorized to categories, by means of one categorizing method fixedly set up in advance, based on the categorizing learning information 4, and outputs the categorizing result data 7.
In the conventional information categorizing apparatus (see FIG. 18), it has been described that there are a plurality of methods as the feature element extraction method in the feature element extraction section 1. However, since the categorizing accuracy in the categorizing result data 7 changes depending on the content and quantity of the new text group 6 to be categorized, it is difficult to uniquely specify the versatile extraction method that maintains high categorizing accuracy with respect to the new text group 6 of various contents and quantities.
Also in the categorizing learning information generation section 3, it has been similarly described that there are a plurality of categorizing learning methods. However, since the categorizing accuracy in the categorizing result data 7 changes depending on the content and quantity of the new text group 6 to be categorized, it is difficult to uniquely specify the versatile categorizing learning method that maintains high categorizing accuracy. Accordingly, with conventional information categorizing apparatus, one of the plurality of categorizing methods (feature element extraction method, categorizing learning method) is fixedly used inevitably.
Therefore, with the conventional information categorizing apparatus, categorizing of the new text group 6 is performed by one fixed categorizing method, causing a problem in that the categorizing accuracy varies depending on the content and quantity of the new text group 6, and hence, resulting in low categorizing accuracy.
It is an object of the present invention to provide a method and apparatus for categorizing information, which can increase the categorizing accuracy, regardless of the content and quantity of the information to be categorized.
In the method a apparatus for categorizing information according to the present invention, a plurality of categorizing methods are kept in a usable condition, and after a categorizing method having the highest categorizing accuracy is determined by the categorizing method determination unit from among the plurality of categorizing methods, based on the categorizing sample information, a new text group is categorized for each category according to this categorizing method. As a result, the categorizing accuracy can be increased compared to the conventional apparatus, regardless of the content and quantity of the information to be categorized.
The computer readable recording medium according to the present invention records a computer program which when executed on a computer realizes each and every step of the method according to the present invention. As a result, the method according to the present invention can be realized very easily and automatically.
Other objects and features of this invention will become apparent from the following description with reference to the accompanying drawings.