1. Field of Invention
The present invention relates to a method and apparatus for document classification and, more particularly, to a method and apparatus for classifying a document according to language and topic.
2. Description of Related Art
Optical character recognition and the use of optical character recognition to convert scanned image data into text data suitable for processing by a digital computer is well known. In addition, methods for converting scanned image data into text data and the types of errors such methods generate are well known. However, the selection of a proper method for error correction is highly dependent upon the language of the document.
Methods for optical character recognition and for error correction in optical character recognition systems have been provided on the assumption that the language used in the document was known in advance or assumed to be in the language of the country in which the system is being used. That is, in the United States, conventional optical character recognition systems assume that the document is in English. Alternately, an optical character recognition system can be implemented with character recognition and error resolution methods for a plurality of languages.
An optical character recognition system has been developed that automatically determines the language of a document. The system generates word shape tokens from an image of the text and determines the frequency of the word shape tokens that corresponds to a set of predetermined word shape tokens. The system then converts the frequency of appearance rates to a point in a new coordinate of space and then determines which predetermined language region of the new space coordinate the point is closest to and thereby determines the language of the text. However, this system has not been able to achieve high accuracy because it does not appreciate the quality of the document image.
Another system has been developed that categorizes documents into topic categories. This system generates word shape tokens representing words appearing in the document. This system eliminates certain unimportant word shape tokens and ranks the remaining word shape tokens according to their frequency of appearance. These frequencies are then used to categorize the document as being written on a specific topic. However, this system also has not been able to achieve high accuracy because it also does not appreciate the quality of the document image.
Therefore, it has not been possible to achieve high accuracy in topic or language categorization of documents because of the wide range of document image quality.