The following relates to the information storage and processing arts. It finds application in conjunction with electronic document format conversion and in particular with cataloging of legacy documents in a marked-up format such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), or the like, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
Legacy document conversion relates to converting unstructured documents existing in formats such as ADOBE® portable document format (PDF), various text formats, various word processing formats, and the like into structured documents employing a markup language such as XML, SGML, HTML, and the like. In structured documents, content is organized into delineated sections such as document pages with suitable headers/footers and so forth. Alternatively, a textual header of a table spanning out multiple pages comprises a pagination construct when this header repeats itself on those pages. Such organization typically is implemented using markup tags. In some structured document formats such as XML, a document type definition (DTD) or similar document portion provides overall information about the document, such as an identification of the sections, and facilitates complex document structures such as nested sections.
There is thus interest in converting unstructured documents to a structured format when such structure can facilitate storage and access of this document as a “legacy document”. The particular motivations for converting documents are diverse, typically including intent to reuse or repurpose parts of the documents, desire for document uniformity across a database of information store, facilitating document searches, and so forth. Technical manuals, user manuals and other proprietary reference documents are common candidates for such legacy conversions.
A particular problematic issue that arises during the conversion process is the rebuilding or preserving of structural information. The output structure can be very different from the input structure. For example, page segmentation is often discarded in a logical representation, where logical units are elements such as chapters and sections. Pages are usually considered as a physical element and do not appear. Content elements related to this page segmentation, typically headers and footers, and present in the input document then need to be processed cautiously. In prior art converters, such as PDF2[XML/HTML], headers and footers are not differentiated from the body elements and can disrupt the flow of the main text. This not only generates an incorrect logical document, but can also introduce noise for further processing, such as natural language processing. Accordingly, existing methods and systems for identifying and extracting pagination constructs in the conversion of structured legacy documents is neither efficient nor robust. Of particular note is Xiaofan Lin, “HEADER AND FOOTER EXTRACTION BY PAGE-ASSOCIATION”, HP® Laboratories Palo Alto, May 6, 2002, 9 Pages. This reference relies upon comparison with neighboring pages for identifying a particular relationship indicative of commonly configured headers/footers. Such neighboring page comparison techniques can fail when the header/footer occurs very few times in the document.
For purposes of this application, “header” is intended to comprise matte, i.e., textual content, that is printed at the top of every page of the document, typically positioned in the top margin of the page. For example, a title, page number, file name, revision dates, the author's name, or any other information about the document that is repeated throughout the document or a portion of the document is considered header matter. Likewise, a “footer” includes similar information content positioned in the bottom margin of the page. As used in the subject application, “header/footer” should be construed to include either a header or footer individually or in combination.
This disclosure provides methods and systems to provide a light and robust method and system for detecting page constructs, such as headers and footers of a document.