1. Field of the Invention
The present invention relates to a method and apparatus for automatic document recognition and, more particularly, to a method for automatically converting character cells of a document to abstract character codes and word tokens.
2. Description of Related Art
Optical character recognition and the use of optical character recognition to convert scanned image data into text data suitable for use in a digital computer is well known. In addition, methods for converting scanned image data into text data and the types of errors such methods generate are well known. However, the selection of a proper method for error correction is highly dependent upon the language of the document. Conventionally, the methods for optical character recognition and for error correction in optical character recognition systems have been provided on the assumption that the language used in the document is known in advance. An optical character recognition system can be implemented with the character recognition and error resolution methods for a plurality of languages.
However, it has heretofore not been possible to have the optical character recognition system automatically determine the language of the document. Rather, as each document is provided to the optical character recognition system, some indication of the particular language of the document must be provided to the optical character recognition system. This has been accomplished by either having the operator input data concerning the language and script of the document to the optical character recognition system, or by having the document provided with special markings which indicate the language of the document.