The following relates to the information processing arts, information retrieval arts, archiving arts, document structuring arts, and related arts.
There is ongoing interest in converting legacy documents to standardized formats to facilitate archiving, document searching and information retrieval, and other document storage and processing applications. A legacy document is a document formatted in a format other than the standardized format that is used for the archiving or other standardized document processing. Legacy documents may have various formats such as word processing formats, spreadsheet formats, rich text formats, plain text or ASCII formats, portable document format (PDF), or so forth. A single standardized format is usually selected, although the choice of standardized format can vary. The extensible markup language (XML) format is one format that is commonly used as the standardized format, although other formats such as standard generalized markup language (SGML) or hypertext markup language (HTML) are also suitable.
The standardized format provides a structured paradigm for formatting the documents. XML, for example, defines tags delineated by angle brackets of the form “< . . . >”, possibly with a closing tag of the form </ . . . > defining a container. The precise meaning of the various tags can vary, and is defined for a particular XML document by a document type definition (DTD) section or other embedded or referenced document formatting definition. Because of advantages provided by a structured paradigm, conversion of a document in an unstructured format to a structured format such as XML can also be desirable in other contexts besides archiving or retrieval of legacy documents.
Converting an unstructured document to a structured document such as XML entails obtaining meaningful structural information about the unstructured document for use in the structuring. This can be done manually. However, to facilitate automated or semi-automated document conversion, it is advantageous to automatically identify structural features in a document.
Accordingly, substantial work has been directed toward the identification and delineation of structural features in unstructured documents. The following U.S. patent publications include the present inventor as inventor or co-inventor: U.S. 2009/0110268 A1 titled “Table of Contents Extraction Based on Textual Similarity and Formal Aspects”; U.S. 2009/0046918 A1 titled “Systems and Methods for Notes Detection”; U.S. 2008/0114757 A1 titled “Versatile Page Number Detector”; U.S. 2008/0077847 A1 titled “Captions Detector”; U.S. 2008/0065671 titled “Methods and Apparatuses for Detecting and Labeling Organizational Tables in a Document”; U.S. 2007/0196015 A1 titled “Table of Contents Extraction with Improved Robustness”; U.S. 2006/0248070 A1 titled “Structuring Documents Based on Table of Contents”; 2006/0156226 A1 titled “Method and Apparatus for Detecting Pagination Constructs including a Header and a Footer in Legacy Documents”; and U.S. 2006/0155703 A1 titled “Method and Apparatus for Detecting a Table of Contents and Reference Determination; all of which are incorporated herein by reference in their entireties. Work in this area goes back still further in time, as evidenced for example by Slocombe, U.S. 2004/0006742 A1 which is incorporated herein by reference in its entirety.
Numbered sequences present a rich and diverse source of structural information. Examples of numbered sequences in documents include: numbered chapters or other numbered document sections; numbered pages; numbered figures; numbered tables; numbered equations; and so forth. Automatic numbered sequences detection encounters complexities such as nesting (for example, section numbers may be nested within a numbered chapter), interspersing (for example, numbered figures captions may be interspersed amongst numbered equations); and so forth. Another complication is the wide diversity of numbering schemes. To provide just a few examples: “1, 2, 3, . . . ”; “1.1, 1.2, 1.3, . . . ”; “A., B., C., . . . ”; “(i), (ii), (iii), (iv), (v), . . . ”; “I-, II-, III-, IV-, V-, . . . ”; and so forth. Existing techniques for automatic detection of numbered sequences typically make limiting assumptions on the layout of numbered items in the document (for example, assuming nesting) and about the numbering scheme. Some existing techniques also make limiting assumptions about the distance between numbered items in a numbered sequence, for example ignoring a potential “next numbered item” if it is too far away in the document from a “current numbered item”. These limiting assumptions limit the number of different numbered sequences that can be detected by these techniques. For example, placing a distance limit on the detection can prevent detection of numbered chapter headings which may be spaced apart by large distances in the document. Using an assumed numbering scheme formalism eliminates the ability to detect numbered sequences that employ a numbering scheme that does not comport with the assumed numbering scheme formalism. An assumption of nesting of numbered sequences can result in failure to detect interspersed numbered sequences.
Accordingly, there remains interest in the development of improved numbered sequence detectors that make fewer limiting assumptions and consequently can detect a wider range of numbered sequences in documents.