The following relates to the information storage and processing arts. It finds application in conjunction with electronic document format conversion and in particular with cataloging of legacy documents in a marked-up format such as extensible markup language (XML), standard generalized markup language (SGML), hypertext markup language (HTML), or the like, and will be described with particular reference thereto. However, it is to be appreciated that the following is amenable to other like applications.
More than ever, documents are now central in many activities. Knowledge is stored in documents and knowledgeable exchange is performed by circulating those documents. In this context, the recent evolution toward “structured documents” (especially around the W3C XML language format) as an effort to endow documents with new properties will continue to ease the automatic processing of the documents.
Legacy document conversion relates to converting unstructured documents existing in formats such as Adobe® portable document format (PDF), various text formats, various word processing formats, and the like into structured documents employing a markup language such as XML, SGML, HTML, and the like. In structured documents, content is organized into delineated sections such as document pages with suitable headers/footers and so forth. Alternatively, other kinds of segmentable text blocks can be identified. Such organization typically is implemented using markup tags. In some structured document formats such as XML, a document type definition (DTD) or similar document portion provides overall information about the document, such as an identification of the sections, and facilitates complex document structures such as nested sections.
There is thus interest in converting unstructured documents to a structured format when such structure can facilitate storage and access of this document as a “legacy document”. The particular motivations for converting documents are diverse, typically including intent to reuse or repurpose parts of the documents, desire for document uniformity across a database of information store, facilitating document searches, and so forth. Technical manuals, user manuals and other proprietary reference documents are common candidates for such legacy conversions.
A particular problematic issue that arises during the conversion process is the rebuilding or preserving of structural information. The output structure can be very different from the input structure and depending on what one wants to do with a document, a different structure may be needed. For example, a layout-oriented structure allows publishing a document on different media but would not very much help semantic search or automatic summarization, and page segmentation is often discarded in a logical representation, where logical units are elements such as chapters and sections (pages are usually considered as a physical element and do not appear).
The “document understanding” or “document analysis” research field precisely aims at analyzing a presentation-oriented document representations to build some more abstract document structures. It is a very heterogeneous field since different disciplines, such as image processing (OCR, document page layout analysis) and Natural Language Processing aim at analyzing documents. Each of these disciplines has its particular view point and vocabulary and there is not yet anything like a shared understanding of what “presentation-oriented”, “logical” or “content-oriented” structures might be. Nevertheless, there is a shared working hypothesis about their interdependency. (Marco Aiello, Christof Monz, Leon Todoran, and Marcel Worring, Document understanding for a broad class of documents. International Journal of Document Analysis and Recognition, 5:1-16, 2002. Richard Power, Donia Scott, and Nadget Bouayad-Aga. Document Structure. Computational Linguistics, 29(2):211-260, 2003.)
Thus there is a need for transforming a document (more precisely, documents comprising a homogeneous collection) with a layout-oriented structure into a document with a more abstract generic structure hereinafter identified as a “logical structure”. The logical structure can then be used as an intermediary step toward a content-oriented structure, more specific to a particular document or document collection. Such a system would be particularly advantageous if the “presentation-oriented”, “logical” and “content-oriented” structures could be related, i.e., by using both knowledge of the layout and knowledge of the content to reach the desirably structured document. Additionally, it would also be advantageous if information related to the document could be computed at the collection level.
For purposes of this application, “layout” is intended to include the physical presentation of a document including segmenting constructs such as chapters, sections, pages, tables and appendices. By “content”, it is intended to comprise the textual material itself within the document. By “collection”, it is intended to mean a related or homogeneous associated set of documents, e.g., a collection of technical manuals relating to a particular product line.
Copending, commonly assigned applications comprise a Method and Apparatus for Detecting a Table of Contents and Reference Determination Ser. No. 11/032,814; Method and Apparatus for Detecting Pagination Constructs Including A Header and A Footer In Legacy Documents Ser. No. 11/032,817; and Systems and Methods for Converting Legacy and Projecting Documents Into Extended Markup Language Format, (Ser. No. 10/756,313, filed Jan. 14, 2004), which are herein incorporated by reference.
The following provides improved apparatuses and methods that overcome the above-mentioned disadvantages and others to provide structuring documents based on their content, layout and collection.