OCR systems may used to transform images of paper documents into a computer-readable and computer-editable form which is searchable. OCR systems may also be used to extract data from such images. A typical OCR system consists of an imaging device that produces the image of a document and software that runs on a computer that processes the images. As a rule, this software includes an OCR program, which can recognize symbols, letters, characters, digits, and other units and compound them, if they are arranged next to each other, into words, which may then be checked by means of a dictionary. Traditional OCR systems output plain text, which typically has simplified layout and formatting, retaining only paragraphs, fonts, font styles, font sizes, and some other simple properties of the source document.
However, a document may be regarded not only as text, but as an object with a physical and a logical structure.
The physical structure or document layout is in fact what makes text information a document. Physical structure is intended to keep information in an ordered form for proper and better presentation. It manifests itself as the physical arrangement of form elements such as images, tables, columns, etc. An OCR program may detect the position of form elements in a document and reconstruct them but it does not understand the purpose or meaning of the form elements. Further, the OCR program does not understand the relations between the various form elements.
The logical structure of the document maps the form elements into one or more logical blocks based on an understanding of the meaning of the form elements and the relations between them. The logical structure is what controls the logical ordering (e.g., viewing and reading order) of the information in a document. The logical structure includes information about the purpose and/or meaning of all form elements and defines the reading order in which the information contained in the document should be perceived. It is tightly linked with the document's physical structure and depends on the relations among the various formatting elements and their reading priorities.
The logical structure may not be so obvious from the usual, human's point of view. In most cases a “human reader” comprehends the logical structure of documents automatically; it is self-evident to him and inseparable from the document's physical structure. But this human perception is not characteristic of computers and, in particular, of OCR and document conversion programs. The logical structure of a document is beyond the traditional “machine comprehension” and may become a bottleneck in automated document recognition.