Existing OCR engines may suffer from low accuracy rates. For many applications where OCR may be highly desirable, the error rates of commonly available OCR engines may be unacceptably high, even for relatively simple documents. Additionally, when dealing with documents having complex layouts and contents, the best presently available OCR engines may still have a relatively low recognition rate. Therefore, an improved method of post processing OCR output data is desirable. Post-processing systems may be used to attempt to correct these errors improving the quality of the text.
Frequent OCR errors include too many erroneous characters within a word and word segmentation errors, but other errors, such as systematic mis-recognition of particular strings may exist as well.
One common post-processing method involves comparing the OCR data to a dictionary. The dictionary may contain commonly occurring character strings as terms in addition to words, depending on the application. Various methods may be used to determine appropriate corrections for OCR data not matching terms in the dictionary, but often more than one equivalent correction may exist for a given character string. Dictionary methods may also have particular difficulties dealing with numeric data, acronyms, and proper names.
Many existing post-processing systems assume machine recognized text to have a high recognition rate. For example, in most systems, a numeric string recognizer may be used such that numeric strings are simply bypassed without any further processing. However, in practice, not all numeric characters will be recognized correctly (e.g. “3000” may be recognized as “300o”). In such cases, it is desirable for a post-processing scheme needs to correctly recognize a string, such as “300o” as numeric and provide correction.
Similarly, acronyms in inaccurate OCR data may not be properly recognized as such by many post-processing systems. Further, errors occurring in a proper noun may be difficult to detect. In many post processing systems, only the morphology of acronyms and proper nouns is used to detect these character strings. For example, an acronym may be defined as “a string of three to six uppercase letters, bounded by non-uppercase letters.” This definition is often useful for acronyms, but it is generally too limited to detect proper nouns. A surname or given name, for example, frequently does not exist in a lexicon, but it is desirable for these names to be recognized as proper nouns.
Stochastic n-gram models have been proposed as powerful and flexible methods to parse the text. In “Adaptive Post-Processing of OCR Text via Knowledge Acquisition,” ACM 1991 Computer Science Conference, Liu et al. used a tri-gram method to detect possible error characters in a word. If OCR output data has a high accuracy level, this method is reasonably efficient, but the method is less efficient for less accurate data sets.