1. Field of the Invention
The present invention relates to the field of electronic text processing of text derived from optical character recognition (OCR) or intelligent character recognition (ICR) devices. More specifically, this invention relates to a method and apparatus for generating reasonably possible electronic text from a text containing possible errors derived from images by optical character recognition or intelligent character recognition, and to selecting the correct text from the set of possible texts.
2. Discussion of Related Art
Document processing systems employing optical character recognition (OCR) and intelligent character recognition (ICR) devices for scanning and storing the contents of documents are well known in the art. In a typical document processing system of this nature, documents are fed into a transport scanning device which serially scans each document, stores the data and passes the document to other devices for further processing. The scanned image of each document is converted into a bit-map, i.e., digitized image data, of the entire document. The bit-mapped image data is then transmitted to a character recognition engine where the image data is analyzed in an attempt to convert desired portions of the image data into discrete electronic text characters through character recognition. If the data is successfully recognized as one or more alphanumeric characters, it is transformed into discrete alphanumeric characters for storage and future processing. For example, data thus converted into the alphanumeric characters can be stored in a conventional computer database for future access and/or electronic processing without the need to further physically handle the original documents.
Document processors employing OCR and ICR devices have been utilized to facilitate processing of pre-formatted business forms with some degree of success. For example, such processors are currently used to read information printed on checks. Automated scanning and processing of checks is advantageous because the type of information contained on checks are contained within one or more discrete fields and all of the data to be scanned is of the same type, i.e., all numerals.
However, while the use of such document processors has long offered the potential for significantly reducing costly manual information processing, in practice, OCR and ICR document processors have only enjoyed limited application because they are prone to yield inaccurate results. Restated, the full benefits of wholly automated information processing have heretofore been significantly limited by the ability of OCR and ICR based document processors to accurately recognize the data contained on the above-mentioned forms.
In particular, the OCR and ICR art has continued to struggle with the problem of automated recognition of handwritten data and data of mixed alphanumeric character.
Accurate recognition of handwriting has proven to be a particularly illusive goal due to the unconstrained nature of handwriting and the large variety of handwriting styles. Thus, character recognition errors continue to severely limit the utility of document processors employing optical character recognition devices where the information to be processed has been handwritten on documents. The main error that occurs in processing is substitution errors, which occur when a given character being analyzed is incorrectly identified as another character(s). Substitution errors include (1) incorrect identification of a single character as a different character; (2) incorrect identification of a single character as multiple characters; and (3) incorrect identification of multiple characters as a single character. Because the recognition device always yields some data when a substitution error occurs, substitution errors can be difficult to detect.
Methods are known to attempt to correct such errors, but these methods are extremely limited and require excessive amounts of human intervention to solve the problem. First, errors are typically checked for on a one-to-one character replacement basis, and substitutions such as one-to-many characters, many-to-one characters and one-to-none characters are not checked, thus severely limiting the ability of the method to determine error correction.
Further, the correction methods typically involve querying a user for correction of the error, often presenting the user with an image of the error along with a set of possible corrections derived from a standard dictionary database. See, for example, U.S. Pat. No. 6,005,973, describing a method in which the process gathers the most likely character sequences associated with the error, and presents the results of the method to the user for selection of the correct character sequence.
Sometimes the dictionary database is able to correct the error to the correct text based upon a high level of confidence that it could be the only correction possible. This is only the case, however, when the converted text contains only alphabetical text that would be found in the dictionary. Where the text contains mixed alphabetical and numerical text, for example as might be found with part numbers, product codes, etc., the query to the dictionary always fails and this prior art methodology thus is inadequate to deal with such text without requiring the need for frequent human interaction. However, presenting the error to a human operator to rectify the error makes the process extremely expensive and time consuming.
U.S. Pat. No. 5,850,480 describes methods of correcting optical character recognition errors occurring during recognition of character sequences contained within one or more predetermined types of character fields. The methods may be practiced with a document processing system having (1) a optical character recognition device for scanning documents and outputting bit-map image data; (2) a recognition engine for converting the bit-map image data into possibly correct alphanumeric characters with associated confidence values; and (3) at least one lexicon of character sequences consisting of a list of at least a portion of all of the possible character sequence values for each of the fields being processed. OCR errors are corrected by performing a contextual comparison analysis between the alphanumeric characters outputted from the recognition engine and the lexicon of character sequences. However, this method is designed to work only with specific-types of texts entered into specific fields, for example address fields, of a form, looks at letters and numbers separately instead of mixed alphanumeric text, and requires assignment of confidence levels to order possible text for selection by a user.
Thus, there exists a need in the art for OCR error correction methods and apparatus capable of enhancing the accuracy of optical character recognition of machine-print and hand-print, particularly print of mixed alphanumeric characters, requiring a reduced level of human intervention for correction.
It is therefore an object of the present invention to provide an improved method, and apparatus for conducting the method, for generating versions of reasonably possible text given the text version with errors from ICR/OCR devices, particularly of text that may be of mixed alphanumeric type. It is a still further object of the present invention to conduct the method so as to reduce the amount of required human intervention required in correcting converted text with errors to correct text.
It is another object of the present invention to provide a method of deriving a set of possible correct texts from converted text with errors, and apparatus for conducting the method, in which the character substitutions examined by the method include more than just one-to-one character substitutions, but also include, for example, one-to many, many-to-many, many-to-one and one-to-none character substitutions so that the set of possible correct texts includes a larger number of possible texts of varying lengths, and thus is more likely to include the correct text within the generated set of possible texts.
These and other objects of the invention are achieved by the methods of the present invention, which provides a systematic method for generating versions of reasonably possible electronic text given the ICR/OCR version with errors. Each of these new possible versions of the image converted electronic text can then be used for matching to a database or another text source, with an error only being declared to an operator after the list of possible texts has been exhausted without a match, or with multiple matches. The method may not completely remove human intervention, but the need for such intervention is greatly reduced. A great advantage of the method of generating the possible forms of the text is that it extends the obvious substitution method (e.g., one-to-one such as the number zero for the letter O or vice versa) by using one-to-many, many-to-one, many-to-many and one-to-none substitutions based on commonly occurring errors.
These and other objects of the invention are thus achieved by a method of, and apparatus for, deriving a set of possible text sequences for character sequences of converted text, comprising receiving as input a converted text sequence from a character recognition device; comparing a character sequence comprised of one or more in-sequence characters of the converted text sequence to a first table containing either unidirectional or bi-directional substitution sequences to obtain a set of substitution sequences associated with the character sequence; and subsequently comparing the character sequence to a second table containing either unidirectional or bi-directional substitution sequences, wherein if the first table is a unidirectional table then the second table is a bi-directional table and if the first table is a bi-directional table then the second table is a unidirectional table, to obtain any additional possible substitution sequences associated with the character sequence; the obtained character sequence and associated substitution sequences representing the set of possible text sequences for the character sequence of the converted text.
These and other objects are thus also achieved by a method of, and apparatus for, matching converted text from a character recognition device to correct text, comprising receiving as input a length of converted text sequence from a character recognition device; evaluating the length of converted text sequence and determining possible erroneous character sequences comprised of one or more in-sequence character sequences; comparing character sequences comprised of one or more in-sequence character sequences of the converted text sequence to at least one table containing substitution sequences to obtain a set of substitution sequences associated with each such character sequence evaluated, thereby obtaining a master group of possible text sequences for the length of the converted text sequence; and comparing the master group of possible text sequences for the length of the converted text sequence to an external database for a match. This method can also be carried out only upon those character sequences determined to have a likelihood of being erroneous.