The identification of products using computer readable bar codes, wherein digital data is recorded directly on paper, provides for item identification given a fixed set of values using simple numeric encoding and scanning technologies. Identification of computer generated and stored documents is another technology which has been developed using binary encoding to identify and provide for retrieval of stored documents. Most document-generating software programs provide not only identification and/or retrieval information for the document, but also include encoded information for provision to an associated printer specifying, for example, such details as spacing, margins and related layout information. Once the document has been printed on paper, however, that information no longer accompanies the document, other than as discerned by the user. If it is desired to reproduce the document using an optical character recognition (OCR) system, there is no automatic means by which to communicate the layout information through the scanner and to the receiving computer. A desirable extension of the identification technology would be, therefore, the provision of a means for generating a paper version of a document which can be recognized, reproduced and proofread by a computer by optically scanning a marker incorporated in or on the paper document in conjunction with the OCR text scanning of the document.
Document or product identification systems which have been employed in the past include bar code markers and scanners which have found use in a wide range of arenas. With respect to paper documents, special marks or patterns in the paper have been used to provide information to a related piece of equipment, for example the job control sheet for image processing as taught by Hikawa in U.S. Pat. No. 5,051,779. Similarly, identifying marks have been incorporated into forms as described in U.S. Pat. No. 5,060,980 of Johnson, et al. The Johnson, et al. system provides for the editing of forms which are already resident in the computer. A paper copy of the form is edited by the user and then scanned to provide insertions to the fields of the duplicate form that is stored electronically in the computer. Still another recently patented system is described in U.S. Pat. 5,091,966 of Bloomberg, et al. which teaches the decoding of glyph shape codes, which codes are digitally encoded data on paper. The identifying codes can be read by the computer and thereby facilitate computer handling of the document, such as identifying, retrieving and transmitting the document. The systems described in the art do not incorporate text error detection or correction schemes. Further, the systems require that the associated computer have a copy of the document of interest in its memory prior to the input of information via the scanning. The systems cannot be applied to documents which are being created in the scanning computer by OCR.
Optical character recognition systems, are illustrated schematically in FIG. 1, generally include a digitizing scanner, 16, and associated "scanning" computer, 18, for scanning a printed page, 14, which was generated by an originating computer, 12, and output by a printer, 13. The scanner, 16, extracts the text to be saved, as electronic document, 15, in a standard electronic format, such as ASCII. What is desirable is to additionally incorporate information about the text and layout for error detection and correction, which information can be optically scanned or otherwise automatically input.
Due to the inherent limitations in both the scanning process and the ability to an optical character recognition system to effect accurate character recognition, errors are introduced into the output, including not only character misinterpretation errors but also layout-dependent errors. The typical character misinterpretation errors which occur in the OCR reproduction of documents include the following: substitution errors, wherein erroneously-identified characters are substituted for the actual printed characters (e.g., "h" for "b", wherein "the bat" becomes "the hat"); deletion errors, wherein characters or spaces are erroneously omitted from the scanned region (e.g., "the bat" becomes "that"); and, insertion errors wherein characters or spaces are erroneously inserted into the reproduced region (e.g., "the bat" becomes "t, he b at"). In addition, a common error can, in fact, be a combination of these basic error types (e.g., reading "rn" for "m" involves a substitution and an insertion, while reading "H" for "fl" involves a substitution and a deletion). In addition, entire lines of text can be inserted or deleted in the course of OCR scanning and reproduction. Traditional error detection/correction schemes generally operate to detect and correct substitution errors but are ineffectual at detecting and correcting deletion and insertion errors of the kind encountered in OCR, as further discussed herein.
Post-processing, specifically error detection and correction, must then be performed, primarily by human proofreading of the reproduced document. Errors in layout are ordinarily not automatically rectifiable by the computer; but, rather, require extensive, user-intensive editing or possibly re-creation of the document. The human post-processing is expensive not only in terms of actual cost but also in the time needed to complete the processed document. Optimally, solutions will provide not only a means for detecting character substitution errors but also a means for detecting and correcting all of the character and line misinterpretation errors. Further, an ideal solution should additionally facilitate identification of the document itself and communicate the appropriate layout structure for the document.
Error detection/correction systems which have been employed in the computer document creation technology (e.g., word processing) include techniques based on dictionary lookup and/or attempts to use semantic, or context, information extracted from the document in order to identify and correct errors. Many of these systems require that entries in the document which do not correlate to an entry in the lexicon will be reviewed by a "human post-processor". The automated error correction version of a dictionary-based system will, upon identification, spontaneously correct entries which do not correlate to dictionary entries. One can readily envision a circumstance wherein automatic spelling correction is not desirable, such as in the case of a proper name, an intentional misspelling or a newly coined term. The presumption in the use of dictionary-comparison versions of such systems is that each entry in the entire document be compared to a data-base dictionary of terms. The cost of comparison of each entry of a document to a given lexicon is quite high. Streamlined error detection and location, without the need for entry-by-entry comparison, is desirable.
The use of semantic information extracted from the document is further proposed in the art in order to facilitate the identification and automatic correction of errors that have been detected but which cannot be readily identified as misspellings of available dictionary terms or which "resemble" more than one available dictionary entry. Such a system will recognize and correct the term "ofthe" to "of the" when a dictionary lookup would simply reject the term or miscorrect it. Similarly, a bank of commonly-occurring, errors for the hardware or software being used, and for the font or fonts being scanned, has been proposed for use with the context, or semantic, information in order to identify and automatically correct common errors, such as "rn" being incorrectly identified as "m", or the letter "O" being incorrectly identified as the number "0".
To detect errors without requiring an entry-by-entry lookup, particularly for documents which are transmitted over extended networks, systems have made use of parity bits transmitted with the data. Once the transmission has been effected, a bit count is done on the "new" document. If the calculated bit matches the transmitted parity bit, then an error-free transmission is assumed. Such systems, and extensions of the parity and check bit concept, as taught in U.S. Pat. No. 5,068,854 of Chandran, et al., are useful for detecting errors in digitally encoded information. Further extensions of the parity bit concept, such as balanced weight error correcting codes, to detect and provide correction of more than a one-bit error are also found in the art, such as in U.S. Pat. No. 4,965,883 of Kirby. Parity and check bit systems developed for use with binary coded information are capable of ascertaining the presence of errors with reasonable accuracy given the low probability of the error bit of an erroneously-received quantity of data matching the check bit of the transmitted material. Since the bits are calculated on binary-encoded data, they are most effective for detecting one-bit errors; except as modified in the weighted balancing and random checking instances. Generally speaking, however, the check and parity bit systems tend to be data-independent methods for assuring error-free transmission of computer-to-computer transfers. The check and parity bit systems are not, therefore, considered thorough checking systems but merely first screening techniques which are intended for digital-to-digital communications and not obviously applicable to analog-to-digital conversions such as optical character recognition.
A further prior art system, providing a 16-bit check sequence which is data-dependent and calculated on the contents of the data field, is found in U.S. Pat. No. 4,964,127 of Calvignac, et al. Once again, the system is applied to data which is transmitted along a data path, presumably in digital format.
In the field of optical character recognition (OCR), there is a similar need to provide the means for detecting and correcting errors in data which has been reproduced from optical scanning, bit mapping and computer encoding. Both dictionary lookup and common-error reference have been proposed for use in the OCR context. However, as with the document creation needs of the past, the entry-by-entry checking is inefficient and not guaranteed to produce the correct result. Moreover, in addition to the printed words, the document layout is a critical feature in OCR. The use of current parity bit check systems in an optically-scanned, bit mapped system is only nominally effective for error detection, relatively ineffective for error location and totally ineffective for detection and correction of improper layout.
Apparatus for identifying and correcting "unrecognizable" characters in OCR machines is taught in U.S. Pat. No. 4,974,260 of Rudak. In that system, the characters which are not recognized, in the electronic dictionary lookup operation, are selectively displayed for an operator to effect interpretation and correction. More fully automated OCR error detection and correction is desirable, but not currently available.
U.S. Pat. No. 4,105,997 of McGinn, entitled "Method For Achieving Accurate Optical Character Reading of Printed Text" provides a basic error detection scheme for checking the accuracy of text reproduced using optical character recognition. The McGinn system calculates a check-sum value for each line of data using ASCII text, and prints the check-sum symbol or symbols at the end of each printed line of text in the document. Upon OCR scanning of the printed line, the printed check-sum symbol is also scanned and ". . . processed in a routine manner to produce an ASCII code serial bit stream . . ." Upon reproduction of the printed line, a check-sum value for the reproduced line of text is calculated and compared to the scanned symbol. If the two check-sums do not match, the existence of an error is assumed, the line is rescanned, and the process is repeated until a match is found, if ever. No intra-line error location can be realized by the McGinn system, nor can actual correction of a detected error be conducted short of rescanning and reproducing the line, if even then.
Since the McGinn system encodes the check-sum symbol using ASCII text, the symbol is optically scanned and recognized using the same technology as the standard text. Consequently, error-free location and recognition of the check-sum symbol cannot be guaranteed. The recognition system may not be able to distinguish the symbol from the line text. Moreover, the symbol may be erroneously identified. A difference between the scanned symbol and the calculated check-sum for the reproduced text may, therefore, be indicative of misinterpretation of the check-sum symbol even if accurate reproduction of the scanned text has been achieved. Another class of OCR reduction errors which cannot be accounted for when using the McGinn system is the omission or insertion of entire text lines. Absent a corresponding scanned check-sum, the McGinn system can neither account for nor correct entire line errors. In effect, therefore, the McGinn system simply confirms the accuracy of text reproduced by OCR, as opposed to improving that accuracy.
It is therefore an objective of the present invention to provide a means and method for automatically incorporating information markers on a paper document, which information is encoded to provide a variety of detail about the document to an associated computer.
It is another objective of the invention to establish the absence or presence of errors on a page reproduced using OCR technology without requiring an entry-by-entry comparison.
It is another objective of the invention to provide an error detection system and method for precisely locating errors on a page reproduced using OCR technology.
It is still another objective of the invention to provide an error detection system which can be used in conjunction with existing error correction systems to precisely locate document errors and compensate for deletion and insertion errors before effecting substitution error correction procedures.
Another objective of the invention is to provide an automatic error correction means and method for documents reproduced using OCR technology.
It is yet another objective of the invention to provide an error detection system which can overlook intentional misspellings, abbreviations, etc.
It is a further objective of the invention to provide an error detection system which can be used with any document format, fonts, and related hardware.
It is yet another objective of the invention to provide a means for providing documents with unique markers which can be used to impart various information to computers.
Still another objective of the present invention is to provide a means and method for supplying, documents with computer-readable markers which contain information about the document including document structure, error identification, location and correction information, and document identification and retrieval information.