The following relates to the information storage and processing arts. It finds particular application in conjunction with cataloging of legacy documents in a marked-up format such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), or the like, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
There is interest in the information storage and processing arts in converting document databases to a common structured format that is structured based on document content so as to facilitate searching, document categorizing, and so forth. Some suitable structured document paradigms include XML, SGML, HTML, or so forth. The content of unstructured documents is sometimes arranged by a table of contents that identifies chapters, sections, or so forth. Thus, there is interest in developing methods and apparatuses for extracting the table of contents from the document, and using the extracted table of contents as a framework for structuring the document.
Existing techniques for extracting a table of contents typically involve extracting an ordered sequence of text fragments from the document, and looking for pairs of text fragments that are similar respective to font size, font style, textual content, or so forth. If the position of the table of contents within the document is unknown, this type of processing can lead to N×(N−1)/2 text fragment pair comparisons for a document having N text fragments. Such O(N2) type computations can become prohibitively costly for large documents, e.g., a document including 20,000 to 60,000 text fragments involves approximately 400 million to 3.6 billion text fragment pair comparisons.
On the other hand, if the position of the table of contents is known a priori such that the document can be divided into T table of contents text fragments and N body text fragments, then the number of text fragment pair comparisons is reduced to N×T. For the example document of between 20,000 and 60,000 text fragments indexed by a table of contents containing between 100 and 300 indexing text fragments, between 2 million and 18 million text fragment pair comparisons are involved. This large number of pair comparisons, while reduced compared with the O(N2) type computation, can still be problematic.
Accordingly, there is a continuing need in the art for improved techniques for table of contents extraction.