The following relates to the information processing arts. The following is described with illustrative reference to extraction of a table of contents, but will be useful in numerous other applications such as extraction of other organizational tables such as a table of figures, table of tables, or so forth.
The ability to identify a table of contents in a document has numerous uses. For example, table of contents identification can be used in conversion of an unstructured or shallowly structured document to a structured format stored in XML or another document formalism. In such applications, the identified table of contents provides a logical framework for structuring the document in accordance with chapters, sections, or other elements indexed in the table of contents. Identification of other organizational tables such as a table of figures or a table of tables (i.e., an organizational table listing informational tables of a document) can similarly be used in conversion to a structured format, or can be used to provide pinpoint citations to figures or tables in the document. Identification of a table of contents can also be used to identify major content categories for use in indexing or creation of a knowledge base.
In the process of analyzing an unstructured or shallowly structured document, it is known to convert the document to a text-based format (if it is not already in such a format) using optical character recognition (OCR), and to break the document in the text-based format into text fragments corresponding to sentences, physical lines of text, or other small textual groupings. The organizational table is expected to comprise organizational table entries in the form of a substantially contiguous group of text fragments, each of which is expected to be associable with a target text fragment somewhere in the document that exhibits some similarity with the corresponding organizational table entry.
In some formal approaches, the identification of target text fragments is based on formal considerations. For example, one may expect target text fragments such as chapter headings or section headings to be written in boldface, italics, or another distinctive font style, and/or in a larger font size than the surrounding text, or with a distinctive font effect such as underscoring, underscored, or otherwise highlighted using suitably distinctive text formatting. The particular distinctive text formatting used to highlight target text fragments generally differs from document to document—for example, one document may boldface chapter headings while another document may use all capital letters with no boldface for chapter headings, while yet another document may underscore chapter headings. Moreover, if the document contains more than one type of organizational table, the distinctive text formatting used for target text fragments associated with each organizational table may also differ. For example, the target text fragments for a table of figures typically corresponds to the figure captions, which may be highlighted using text formatting that is different from the text formatting used for chapter headings. As an example, the chapter headings may be boldfaced and underscored, while the figure captions may be italicized.
This demonstrates a significant problem with formal approaches, namely that the distinctive text formatting used to highlight target text fragments generally differs between documents for the same type of organizational table, and may differ within a document for different organizational tables.
A textual similarity based approach for identifying target text fragments has been developed, as disclosed for example in Dejean et al., U.S. Publ. Appl. No. 2006/0155703 A1 which published Jul. 13, 2006. In this approach the organizational table is selected as a contiguous sub-sequence satisfying the criteria that organizational table entries each have a link to a target text fragment having textual similarity with the organizational table entry, and in which no target text fragment lies within the organizational table and the target text fragments have an ascending ordering corresponding to an ascending ordering of the organizational table entries in the organizational table. Textual similarity relates to the content similarity of two text fragments, rather than the text formatting. Thus, for example, the text fragments:
Chapter 1—Introduction to Document Analysis
and
1. Introduction to Document Analysis.
have a high degree of textual similarity because both text fragments include the textual content “introduction to document analysis” and differ only in the early portion (“Chapter 1—” as compared with “1.” in the latter case). However, these two text fragments have substantial text formatting dissimilarity since the former text fragment is italicized with no special capitalization while the latter fragment is not italicized but is underscored, written in all-caps, and indented respective to the left-hand margin of the page.
The approach of U.S. Publ. Appl. No. 2006/0155703 A1 employs textual similarity analysis, and hence is not affected by the variability of the distinctive text formatting employed in different documents and/or for different organizational table types of the same document. Moreover, the approach of U.S. Publ. Appl. No. 2006/0155703 A1 has been found to provide success rates of around 90% per document (that is, about 90% of the identified target text fragments are correct, and about 90% of the actual target text fragments are found).
However, it is desirable to still further enhance the success rate of organizational table identifications.