The following relates to the information storage and processing arts. It finds particular application in conjunction with cataloging of legacy documents in a marked-up format such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), or the like, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
There is interest in the information storage and processing arts in converting document databases to a common structured format that is structured based on document content so as to facilitate searching, document categorizing, and so forth. Some suitable structured document paradigms include XML, SGML, HTML, or so forth.
The content of unstructured documents is sometimes arranged by a table of contents that sets forth a document structure employing chapters, sections, or so forth. Thus, there is interest in developing methods and apparatuses for extracting the table of contents from the document, and using the extracted table of contents as a framework for structuring the document.
Some existing methods and apparatuses for extracting tables of content from unstructured documents rely upon detecting document headings having distinctive font sizes, boldfacing, or so forth that can be detected and associated with table of contents entries. If the unstructured document is paginated, then table of contents extraction may rely upon each section indexed in the table of contents starting on a new page. However, this approach can be problematic if the paginated document includes header information at the top of each page.
The reliability of existing table of contents extraction algorithms can be relatively good, but is less than perfect. Algorithms for identifying a table of contents and associated links to chapter headings, section headings, or so forth can generate incorrect linkages, missed table of content entries, or so forth. For example, the content of a heading may be repeated in the body of the chapter or section, creating ambiguity as to which portion of content should be linked. Complex documents may include multiple copies of the table of contents, for example one copy in each volume of a multi-volume document. In such cases, there is a possibility that the extraction algorithm may incorrectly cross-link between the table of content entries. If the source document is an optically scanned document processed by optical character recognition (OCR), then the resulting electronic document may include textual errors that can lead to erroneous linkages.
Accordingly, there is a continuing need in the art for improved methods and apparatuses for enhancing the robustness of table of contents extraction techniques.