The invention relates to a method for generating descriptors for the classification of natural language texts.
The classification of a text is an assignment to a specific text class and is an important preprocessing step for the automatic further processing of texts. In particular for the automatic interpretation of texts, a preceding classification is of considerable importance because in this manner the expenditure for the knowledge base which needs to be maintained such as, e.g., dictionary memory, syntactic and semantic structure definition, can be limited considerably and the recognition performance can be greatly increased.
Text classification can be divided roughly into two steps, namely the extraction of descriptors and, based on this, the assignment to a class. The selection of the descriptors is of essential importance. The selection is a problem especially for natural language texts having a variety of word forms.
For texts in the English language, which has a small morphological variation, the use of complete word forms or phrases is proposed in "Feature Selection and Feature Extraction for Text Categorization" by D. Lewis in Proc. of Speech and Natural Language Workshop 1992. For classification tasks in morphologically richer languages, word segments can be used as descriptors, with, e.g., the text being broken down into n-grams in "N-Gram-Based Text Categorization" by Canvar/Trenkle in Proc. of Int. Symp. on Document Analysis and Information Retrieval 1994, or use of a reduction to basic forms in "Using IR Techniques for Text Classification in Document Analysis" by R. Hoch in Proc. of SIGIR, 1994.
While the n-gram breakdown results in a very large number of descriptors, the reduction to basic forms requires an expensive analysis for the preparation of the necessary knowledge base. The known procedures are also susceptible to errors in the examined texts, such as typing errors or recognition errors in the character recognition or language recognition.