1. Field of the Invention
The present invention relates to a text classification apparatus and method for classifying a text by learning a text classification knowledge from texts, which have been manually assigned respective categories, for use in the classification of texts, and more particularly, to a cross lingual text classification apparatus and method for classifying texts written in a plurality of languages.
2. Description of the Related Art
The proliferation of word processors and personal computers (PC) permits most of texts to be electronically created, resulting in an increased amount of electronic texts which can be handled on computers. An automatic text classification has been developed as one technology for accommodating such a situation. The automatic text classification uses a text classification knowledge learned from labeled texts (texts which have been manually assigned categories) to newly assign appropriate categories to unlabeled texts (texts which have not been assigned categories).
Conventionally, the text classification has been utilized for classifying texts in a single language such as Japanese, by way of example. However, with the current popularization of the Internet and advanced globalization, an increasing need exists for handling texts in a plurality of languages. To support such a purpose, JP-A-9-6799, for example, discloses a text classification apparatus and a text search apparatus for classifying texts independently of a particular language using a manually created concept dictionary. On the other hand, JP-A-2003-76710 discloses another approach in a cross lingual information retrieval system which translates a query using a bilingual dictionary. The idea disclosed in JP-A-2003-76710 can be applied to a text classification. Specifically, the basic idea of JP-A-2003-76710 can be applied to a text classification by translating texts into a certain single language, for example, English, followed by the classification of the texts.
The conventional approaches imply the following problems.
(1) System Based on Manually Created Concept Dictionary:
The system based on a manually created concept dictionary encounters extreme difficulties in creating a concept dictionary, resulting in a prohibitively high cost which makes the construction of a system infeasible. Particularly, when a system intends to cover wide fields, it is extremely difficult to manually create the concept dictionary.
(2) System Based on Translation Using Bilingual Dictionary:
The translation-based system fails to translate certain words if a bilingual dictionary is not sufficiently prepared, resulting in a degraded classification accuracy. Since a high cost will be entailed for increasing the coverage of the bilingual dictionary as is the case with the concept dictionary, this system is infeasible if the bilingual dictionary is assumed to provide a high coverage.
The bilingual dictionary also implies the problem of ambiguity, meaning that one entry word may have a plurality of equivalents. Generally, the translation-based system relies on a machine translation to select one from possible equivalents of a word or to translate the word into all possible equivalents, wherein the former is more susceptible to leakage, while the latter is more susceptible to noise.