The exemplary embodiment relates to the information storage and processing arts. It finds particular application in conjunction with the conversion of documents available in print-ready or image format into a structured format that reflects the logical structure of the document.
Legacy document conversion relates to converting unstructured documents existing in page description language formats such as Adobe's portable document format (PDF), PostScript, PCL-5, PCL-5E, PCL-6, PCL-XL, and the like, into structured documents employing a markup language such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), and the like. Such structure can facilitate storage and access of the document. The particular motivations for converting documents are diverse, typically including intent to reuse or repurpose parts of the documents, desire for document uniformity across a database of stored information, facilitating document searches, and so forth. Technical manuals, user manuals, and other proprietary reference documents are common candidates for such legacy conversions.
In structured documents, content is organized into delineated sections such as document pages with suitable headers/footers and so forth. Such organization typically is implemented using markup tags. In some structured document formats, such as XML, a document type definition (DTD) or similar document portion provides overall information about the document, such as an identification of the sections, and facilitates complex document structures such as nested sections.
Issues arise in reconstructing conventional constructs such as titles, headings, captions, footnotes, and the like, in particular, the detection of the page numbers of a document. One difficulty with the versatile detection of page numbers resides in the wide variability of their appearance, layout, and numbering scheme within a document, over a collection, and from collection to collection. For example, in the case of appearance, the font type, font size, and font color of page numbers can vary from one document to another and even within the same document. Layout may also vary. In some documents, the page numbers may always appear at the same position on the page, or they may change position on odd/even pages. In other documents, the page numbers may have a different place for the first page of each section, or have various different positions in the various parts of the document, for example, a different position in the preface from that of the table of contents, the body, or the annexes. Sometimes the position of the page number is different for each chapter. In addition, in the case of scanned documents, the position may vary due to translation or skew between scanned pages.
Numbering schemes can also vary. Conventional numbering schemes generally employ Arabic numerals, Roman numerals, or letters. However, there are also page numbering schemes of the form N/M, where N is the page number and M is the total number of pages, or where N is the section number and M the sub-section or page number within the section. There are also composite page numbers of the form: TOC-N, INTRO-N, where N is the page number, counted from the beginning of the document or of the section. There are also schemes in which the numbering is representative of the structure of the document, for example, some of the digits represent the section and others the page. In one scheme of this type, the section number occupies 5 digits followed by the page number in the section. However, even in such highly structured schemes, a different convention may be found in some sections. It is quite common to have multiple numbering schemes in use in the same document. For example, the front matter is partially numbered with Roman numerals, the body in a different manner, and any annex in another manner.
Conventional approaches for applying page numbers work at the page level. For pages missing a number, a human operator validates the output.