The following relates to the information storage and processing arts. It finds particular application in conjunction with converting and cataloging legacy documents in a marked-up format such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), or the like, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
Legacy document conversion relates to converting unstructured documents existing in formats such as Adobe portable document format (pdf), various text formats, various word processing formats, and so forth into structured documents employing a markup language such as XML, SGML, HTML, or so forth. In structured documents, content is organized into delineated structural nodes each containing a section of text, figures, tables, or so forth. The lowest level, or leaves, of the structure typically corresponds to sentences, text blocks, or the like, while higher levels delineate nested, tree-like, or otherwise-organized groupings of nodes. Document structure typically is implemented using markup tags interspersed through the document. In some structured document formats such as XML, a document type definition (DTD) or similar dedicated document portion provides structural information about the document.
There is interest in converting unstructured documents to a structured format. The motivations for converting documents are diverse, including for example: intent to reuse or repurpose parts of the documents; desire for document uniformity across a database of information store; facilitating document searches; and so forth. Initial conversion usually involves breaking the unstructured document into text fragments or other low-level structures, for example delineated by sentences, physical text lines, or other natural breaks in the document. Such a conversion produces an XML or other “structured” document, which however does not include logical structure associated with the semantic content of the document.
One way to introduce a logical structure into the converted document is to make use of the table of contents, if one is available. Unstructured documents often contain the text of a table of contents which provides a natural logical organization or framework for the content of the converted document. In “Method and Apparatus for Detecting a Table of Contents and Reference determination (Xerox ID 20040274-US-NP, Ser. No. 11/032,814 filed Jan. 10, 2005), which is herein incorporated by reference, some suitable techniques are disclosed for extracting a table of contents from a converted document. The document with the extracted table of contents information is organized as a plurality of nodes with corresponding entries of a table of contents. However, the extracted table of contents is “flat”; that is, if the table of contents includes hierarchal levels, such as chapters, sub-chapter sections, or so forth, this hierarchy is not extracted.
Techniques have been developed to reconstruct the table of contents hierarchy, using for example the ordinal numbers associated with the table of contents or section headings, or using other a priori knowledge of the expected hierarchal structure of the table of contents. These techniques are difficult to generalize to a generic hierarchal table of contents that may not include the requisite ordinal numbers or other a priori-known information. Other techniques reconstruct hierarchy based on the physical layout of the document, such as heading fonts, capitalization, or indentation levels. for example, capitalized table of contents entries are likely to be higher up in the hierarchy than entries written in lowercase. Again, these techniques may fail where the table of contents does not employ the requisite formatting. Moreover, the ordinal numbering, physical layout information, and so forth relied upon by these techniques for hierarchal reconstruction is sometimes lost or corrupted in the conversion from the unstructured document to the converted XML or other tagged document.