The exemplary embodiment relates to the processing of documents that were once in hardcopy form into a structured format in order to provide access to content within the documents. It finds particular application in connection with the detection of signature marks in documents.
There is a considerable interest in the conversion of hardcopy documents, such as books, manuals, and proprietary reference documents, into digital form so that they can be more widely accessible to the public, or to facilitate storage of the documents, reusing or repurposing parts of the documents, or providing document uniformity across a database of stored information. Converting an unstructured document to a structured document such as XML entails obtaining meaningful structural information about the unstructured document for use in the structuring. This can be done manually. However, to facilitate automated or semi-automated document conversion, it is advantageous to identify structural features in a document automatically. Generally, hardcopy documents are scanned to provide a set of digital pages. Optical Charter Recognition (OCR) processing of the scanned pages allows text and graphical elements of the page to be identified and labeled accordingly. Then, page numbers, titles, and so forth may be appropriately labeled with a markup language such as extensible markup language (XML), standard generalized markup language (SGML), or hypertext markup language (HTML), among others.
One problem with such automated methods is that signature marks can interrupt the flow of the converted document. In the printing domain, signature marks are small textual elements related to imposition, a step which aims at arranging printed pages. On each side of a single sheet (called a forme), several pages are printed, such as from 2 to 32, or more. The way the pages are arranged depends on the folding schema, which specifies how to fold and section the sheet to provide the leaves of the finished book. A folded sheet is called a gathering or a signature. For example, eight pages may be laid down on one side of a single sheet and eight on the other. Due to the intended folding, the pages are not in the same order as they would be in the finished book. For example, page 1 of the set may be positioned next to the pages 8 and 16. A book made of sheets folded once, to form two leaves (or 4 pages), is called a folio; when folded twice (8 pages), it is called quarto; when folded three times (8 leaves, 16 pages), an octavo, etc. up to 64 folds.
A book is composed of several gatherings. Once folding is done, all the gatherings, which make the book, are ordered and then bound together. In order to avoid errors in this conventionally manual stage, signature marks are left by the printer on some pages of the gathering to indicate the proper sequence in which to bind the printed sheets. This may be the first page in simple folding schema, although some gatherings may have two (or more) signature marks. In general, however, they are found on only a minor proportion of the pages. Signatures commonly run from A-Z, omitting the letters J and U, with letters repeated if the alphabet runs out, e.g., AA-ZZ, AAA-ZZZ, etc. Gatherings are named by the signature mark assigned to them, and leaves can be named by their place within a gathering. Signature marks often have a regularity in their occurrence, but this varies from book to book. For example, signature marks could occur every 2, 8, 16, 25, or 32 pages, depending on the sheet size and its folding. Additionally, the first signature mark often does not occur on the first few pages of a book. Accordingly, given a set of scanned pages of a book, it is very difficult to predict, on which pages the signature marks will occur.
As signature marks are small pieces of text that are somewhat isolated from the rest of the text, typically occurring in the bottom margin, OCR engines have difficulties in correctly recognizing them. Since they often correspond to single letters or numbers, this does not provide enough context to the OCR engines. They can be simply ignored (missed during the zoning step which spots textual zones in a page), or badly recognized. While annotations could be added manually to identify the signature marks in the digital document, this is time consuming and also prone to errors.
The exemplary embodiment provides an automated system and method for detecting signature marks in such documents.