The identification of products using computer readable bar codes, wherein digital data is recorded directly on paper, provides for item identification given a fixed set of values using simple numeric encoding and scanning technologies. Identification of computer generated and stored documents is another technology which has been developed using binary encoding to identify and provide for retrieval of stored documents. Most document-generating software programs provide not only identification and/or retrieval information for the document, but also include encoded information for provision to an associated printer specifying, for example, such details as spacing, margins and related layout information. Once the document has been printed on paper, however, that information no longer accompanies the document, other than as discerned by the user. If it is desired to reproduce the document using an optical character recognition (OCR) system, there is no automatic means by which to communicate the layout information through the scanner and to the receiving computer. A desirable extension of the identification technology would be, therefore, the provision of a means for generating a paper version of a document which can be recognized, reproduced and proof-read by a computer by optically scanning a marker incorporated in or on the paper document in conjunction with the OCR text scanning of the document.
Document or product identification systems which have been employed in the past include bar code markers and scanners which have found use in a wide range of arenas. With respect to paper documents, special marks or patterns in the paper have been used to provide information to a related piece of equipment, for example the job control sheet for image processing as taught by Hikawa in U.S. Pat. No. 5,051,779. Similarly, identifying marks have been incorporated into forms as described in U.S. Pat. No. 5,060,980 of Johnson, et al. The Johnson, et al system provides for the editing of forms which are already resident in the computer. A paper copy of the form is edited by the user and then scanned to provide insertions to the fields of the duplicate form that is stored electronically in the computer. Still another recently patented system is described in U.S. Pat. No. 5,091,966 of Bloomberg, et al, which teaches the decoding of glyph shape codes, which codes are digitally encoded data on paper. The identifying codes can be read by the computer and thereby facilitate computer handling of the document, such as identifying, retrieving and transmitting the document. The systems described in the art do not incorporate text error detection or correction schemes. Further, the systems require that the associated computer have a copy of the document of interest in its memory prior to the input of information via the scanning. The systems cannot be applied to documents which are being created in the scanning computer by OCR.
Optical character recognition systems, as illustrated schematically in FIG. 1, generally include a digitizing scanner, 16, and associated "scanning" computer, 18, for scanning a printed page, 14, which was generated by an originating computer, 12, and output by a printer, 13. The scanner, 16, extracts the text to be saved, as electronic document 15, in a standard electronic format, such as ASCII. What is desirable is to additionally incorporate information about the text for error detection and about the layout thereof, which information can be optically scanned or otherwise automatically input.
Due to the inherent limitations in both the scanning process and the ability of an optical character recognition system to effect accurate character recognition, errors are introduced into the output, including not only character misinterpretation errors but also layout-dependent errors. Post-processing, specifically error detection, must then be performed, primarily by human proof-reading of the reproduced document. Errors in layout are ordinarily not automatically rectifiable by the computer; but, rather, require extensive, user-intensive editing or possibly re-creation of the document. The human post-processing is expensive not only in terms of actual costs but also in the time needed to complete the processed document. Optimally, solutions will provide not only a means for detecting errors but also a means for correcting the errors. Further, an ideal solution should facilitate identification of the document and define the appropriate layout structure for the document.
Error detection systems which have been employed in the computer document creation technology (e.g., word processing) include techniques based on dictionary lookup and/or attempts to use semantic, or context, information extracted from the document in order to identify and correct errors. Many of these systems require that entries in the document which do not correlate to an entry in the lexicon will be reviewed by a "human post-processor". The automated error correction version of a dictionary-based system will, upon identification, spontaneously correct entries which do not correlate to dictionary entries. One can readily envision a circumstance wherein automatic correction is not desirable, such as in the case of a proper name, an intentional misspelling or a newly coined term. The presumption in the use of dictionary-comparison versions of such systems is that each entry in the entire document be compared to a data-base dictionary of terms. The cost of comparison of each entry of a document to a given lexicon is quite high. Streamlined error detection and correction, without the need for entry-by-entry comparison, is desirable.
The use of semantic information extracted from the document is further proposed in the art in order to facilitate the identification and automatic correction of errors that have been detected but which cannot be readily identified as misspellings of available dictionary terms or which "resemble" more than one available dictionary entry. Such a system will recognize and correct the term "ofthe" to "of the" when a dictionary lookup would simply reject the term. Similarly, a bank of commonly-occurring errors for the hardware or software being used, and for the font or fonts being scanned, has been proposed for use with the context, or semantic, information in order to identify and automatically correct common errors, such as "rn" being incorrectly identified as "m", or the letter "o" being incorrectly identified as the number "0".
To detect errors without requiring an entry-by-entry lookup, particularly for documents which are transmitted over extended networks, systems have made use of parity bits transmitted with the data. Once the transmission has been effected, a bit count is done on the "new" document. If the calculated bit matches the transmitted parity bit, then an error-free transmission is assumed. Such systems, and extensions of the parity and check bit concept, as taught in U.S. Pat. No. 5,068,854 of Chandran, et al, are useful for detecting errors in digitally encoded information. Further extensions of the parity bit concept, such as balanced weight error correcting codes, to detect and provide correction of more than a one-bit error are also found in the art, such as in U.S. Pat. No. 4,965,883 of Kirby. Parity and check bit systems developed for use with binary coded information are capable of ascertaining the presence of errors with reasonable accuracy given the low probability of the error bit of an erroneously-received quantity of data matching the check bit of the transmitted material. Since the bits are calculated on binary-encoded data, they are most effective for detecting one-bit errors; except as modified in the weighted balancing and random checking instances. Generally speaking, however, the check and parity bit systems tend to be data-independent methods for assuring error-free transmission of computer-to-computer transfers. The check and parity bit systems are not, therefore, considered thorough checking systems but merely first screening techniques which are limited to digital-to-digital communications and not applicable to analog-to-digital conversions such as optical character recognition.
A further prior art system, providing a 16-bit check sequence which is data-dependent and calculated on the contents of the data field, is found in U.S. Pat. No. 4,964,127 of Calvignac, et al. Once again, the system is applied to data which is transmitted along a data path, presumably in digital format.
In the field of optical character recognition (OCR), there is a similar need to provide the means for detecting and correcting errors in data which has been reproduced from optical scanning, bit mapping and computer encoding. Both dictionary lookup and common-error reference have been proposed for use in the OCR context. However, as with the document creation needs of the past, the entry-by-entry checking is both costly and inefficient. Moreover, in addition to the printed words, the document layout is a critical feature in OCR. The use of current parity bit check systems in an optically-scanned, bit-mapped system is only nominally effective for error detection, relatively ineffective for error location and totally ineffective for detection and correction of improper layout.
Apparatus for identifying and correcting "unrecognizable" characters in OCR machines is taught in U.S. Pat. No. 4,974,260 of Rudak. In that system, the characters which are not recognized, in the electronic dictionary lookup operation, are selectively displayed for an operator to effect interpretation and correction. More fully automated OCR error detection and correction is desirable, but not currently available.
It is therefore an objective of the present invention to provide a means and method for automatically incorporating information markers on a paper document, which information is encoded to provide a variety of detail about the document to an associated computer.
It is another objective of the invention to establish the absense or presence of errors on a page reproduced using OCR technology without requiring an entry-by-entry comparison.
It is another objective of the invention to provide an error detection system and method for precisely locating errors on a page reproduced using OCR technology.
It is still another objective of the invention to provide an error detection system which can be used in conjunction with existing error correction systems to screen a document for errors prior to effecting error correction procedures.
Another objective of the invention is to provide an automatic error correction means and method for documents reproduced using OCR technology.
It is yet another objective of the invention to provide an error detection system which can overlook intentional misspellings, abbreviations, etc.
It is a further objective of the invention to provide an error detection system which can be used with any document format, fonts, and related hardware.
It is yet another objective of the invention to provide a means for providing documents with unique markers which can be used to impart various information to computers.
Still another objective of the present invention is to provide a means and method for supplying documents with computer-readable markers which contain information about the document including document structure, error identification, location and correction information, and document identification and retrieval information.