The technology associated with information delivery has expanded and improved in concert with that of computer systems. To fully exploit the latter systems, techniques for the practical development of databases have been required. Over the somewhat recent past, information systems have been derived providing on-line access to text materials which themselves have been generated from computer-based systems. However, to generate computer accessible databases representing earlier documents or printed works which are not, in themselves, computer generated, technology is required for carrying out a retrospective conversion of print to memory based data.
In general, optical character recognition (OCR) systems are employed for processing printed data to convert it to magnetic media based data. These devices scan a given page of print to generate an ASCII based string of data. Such printed matter or pages and their physical application to scanning devices, however, represent a classification noise environment. In this regard, a smudge may appear about the region of print; pages will be positioned in a skewed orientation with respect to the scanner; and the scanners themselves may evolve subjective fault characteristics or systematic errors. The print or type itself may evoke scanning errors. For example, kerning occurs in font structures to compress letters into adjacency. Ligatures may tend to produce scan errors. Some software-based procedures have been invoked for the purpose of correcting OCR generated inaccuracies but they too are limited in effectiveness. For example, dictionary based spelling check programs as well as grammar check programs have been utilized but all are insufficient for a variety of reasons. For example, punctuation is not addressed and the desirability of correction for a variety of different languages poses problems not readily amenable to correction.
One approach to correction of the outputs of OCR devices which may be contemplated provides for the merging of the outputs of more than one OCR machine or device. Intuitively, if different OCR systems are fairly accurate and make mistakes randomly in different places in the text being scanned, then there should be some way of combining three OCR outputs to generate a string which is more accurate than any of the three input strings. In general, the merging of two strings from two OCR devices in this way would yield a trivial result and the merging of the outputs of four or more OCR devices would be prohibitively expensive computationally. Stings-to-string correction techniques have been proposed and utilized, for example, with applications to molecular biology, computer science, and the like. See in this regard, the following publication:
"Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison" Edited by Sankoff and Kruskal, 1983, Addison-Wesley Publishing Co., Reading, Mass. PA1 The String-to-String Correction Problem" by R. A. Wagner and N. J. Fischer, Journal of the ACM, vol. 21, Jan. 1974, pp 168-173.
A discourse describing the string-to-string correction problem wherein the distance between two strings as measured by the minimum cost sequence of "edit operations" or edit distance needed to change one string to the other is described in the following publication:
With the string merger approach, a common ancestor of three strings would be a string that minimizes the sum of the edit operations or "distances" between the string and the other three, i.e. E is a common ancestor of the strings A, B, and C if E minimizes D(A,E)+D(B,E)+D(C,E). The edit distance, D, between two strings is the minimum number of character insertions, deletions, and substitutions needed to convert one string into the other. Once the distance is found, the next task is to "back track" to calculate the actual string. The string may not be unique.
A somewhat standard algorithm to compute just the distance requires on the order of n.sup.3 operations and on the order of n.sup.2 bytes for storage, where n is the maximum length of the three strings. Computing an actual common ancestor using that distance requires storage on the order of n.sup.3 bytes. For a typical page, n will have a value between 2,000 and 3,000 and the computational burden imposed upon any computer would involve billions of operations and gigabytes of storage, an unacceptable condition.