1. Field of the Invention
The present invention relates to a method and apparatus for automatic document recognition and, more particularly, to a method for automatically determining the language(s) of the document.
2. Description of Related Art
Optical character recognition and the use of optical character recognition to convert scanned image data into text data suitable for use in a digital computer is well known. In addition, methods for converting scanned image data into text data and the types of errors such methods generate are well known. However, the selection of a proper method for error correction is highly dependent upon the language of the document. Conventionally, the methods for optical character recognition and for error correction in optical character recognition systems have been provided on the assumption that the language used in the document is known in advance or assumed to be in the language of the country in which the system is being used. That is, in the United States, conventional optical character recognition systems would assume that the document is in English. Alternately, an optical character recognition system can be implemented with the character recognition and error resolution methods for a plurality of languages.
However, it has heretofore not been possible to have the optical character recognition system automatically determine the language of the document. Rather, as each document is provided to the optical character recognition system, some indication of the particular language of the document must be provided to the optical character recognition system. This has been accomplished by either having the operator input data concerning the language of the document to the optical character recognition system, or by having the document provided with special markings which indicate the language of the document.