1. Field of the Invention
The present invention relates generally to a method for classifying text by significant words in the text.
2. Description of the Related Art
From the reference A. Dengel et al., `Office Maid--A System for Office Mail Analysis, Interpretation and Delivery`, Int. Workshop on Document Analysis Systems, a system is known by means of which, for example, business letter documents can be categorized and can then be forwarded, or stored selectively, in electronic form or paper form. For this purpose, the system contains a unit for segmenting the layout of the document, a unit for optical text recognition, a unit for address detection and a unit for contents analysis and categorization. For the segmentation of the document, a mixed bottom-up and top-down approach is used, the individual steps of which are
Recognition of the contiguous components, PA0 Recognition of the text lines, PA0 Recognition of the letter segments, PA0 Recognition of the word segments, and PA0 Recognition of the paragraph segments. PA0 Letter recognition in combination with lexicon-based word verification, PA0 Word recognition, with the classification from letters and word-based recognition. PA0 Morphological analysis of the words PA0 Elimination of stop words PA0 Generation of word statistics PA0 Calculation of the index term weight by means of formulas known from information retrieval such as, for example, inverse document frequency.
The optical text recognition is divided into three parts:
The address recognition is performed by means of a unification-based parser which operates with an attributed context-free grammar for addresses. Accordingly, text parts correctly parsed in the sense of the address grammar are the addresses. The contents of the addresses are determined via character equations of the grammar. The method is described in the reference M. Malburg and A. Dengel, `Address Verification in Structured Documents for Automatic Mail Delivery`.
Information retrieval techniques for the automatic indexing of texts are used for the contents analysis and categorization. In detail, this takes place as follows:
The index term weights calculated in this manner are then used for determining for all categories a three-level list of significant words which characterizes the respective category. As described in the reference A. Dengel et al., `Office Maid--A System for Office Mail Analysis, Interpretation and Delivery`, Int. Workshop on Document Analysis Systems, these lists are then manually revised after the training phase.
A new business letter is then categorized by comparing the index terms of this letter with the lists of the significant words for all categories. The weights of the index terms contained in the letter are multiplied by a constant depending on significance and are added together. Dividing this sum by the number of index terms in the letter then results in a probability for each class. The detailed calculations are found in the reference R Hoch, `Using IR Techniques for Text Classification in Document Analysis`. The result of the contents analysis is then a list of hypotheses sorted according to probabilities.