The exemplary embodiment relates to document processing. It finds particular application as an apparatus or method for representing context and underlying document structure in a common output for a user to review. The Extensible Markup Language (XML) is a widely used extensible language which aids information systems in sharing structured data, encoding documents, and serializing data. XML is not only useful in creating web pages, but also makes it possible to define the content of a document separately from its formatting, facilitating the reuse of that content in other applications or environments. XML provides a basic syntax for sharing information between different computers, different applications, and different organizations without needing to pass through many layers of conversion.
One type of XML document of particular interest herein is a paginated XML document. The phrase paginated XML document is used because the XML data of this type of document reflects the layout of each page of the document. This XML data structure is common in the document management and processing domain either as an initial, intermediary or final structure. For instance, most, if not all, optical character recognition (OCR) engines offer such an output format. Additionally, many document converters offer XML as either an input or output format, such as, e.g., the well known open source pdf2xml converter which converts information contained in an Adobe® PDF file into an XML document.
Paginated XML documents typically conform to the schema shown in Table 1, as expressed in compact Relax NG. Relax NG (REgular LAnguage for XML Next Generation) is a well known schema language for XML documents.
TABLE 1namespace a = “http://relaxng.org/ns/compatibility/annotations/1.0”start = element DOCUMENT {  element PAGE {   attribute number { xsd:positiveInteger }  }+ }
It is often desirable to display the content of paginated XML documents in a human-friendly manner for various purposes such as when designing a conversion chain (e.g., reviewing the input, intermediary and output documents), when performing some quality assurance (QA) on a document collection, and the like. An XML excerpt produced by the pdf2xml converter is shown in Table 2 below to illustrate the need for a more human-friendly manner of displaying the content of an XML document. Note that Table 2 only shows a portion of the document, and it can be readily appreciated, the difficulty that a person would encounter in navigating through the virtual sea of esoteric attributes.
TABLE 2<?xml version=“1.0” encoding=“UTF-8”?><DOCUMENT><METADATA><PDFFILENAME>05_.pdf</PDFFILENAME><PROCESSname=“pdftoxml” cmd=“-noImage ”><VERSIONvalue=“1.2”><COMMENT/></VERSION><CREATIONDATE>Thu Jun 26 11:41:48 2008</CREATIONDATE></PROCESS></METADATA><PAGE width=“595” height=“842” number=“1”id=“p1”><TEXT width=“125.5” height=“13.284” x=“235” y=“57.208” id=“p1_t1”><TOKEN sid=“p1_s5”id=“p1_w1” font-name=“TimesNewRomanPSMT” bold=“no” italic=“no” font-size=“12” font-color=“#000000” rotation=“0” angle=“0” x=“235” y=“57.208” base=“67.9” width=“37.296”height=“13.284”>Server</TOKEN><TOKEN sid=“p1_s6” id=“p1_w2” font-name=“timesnewromanpsmt” bold=“no” italic=“no” font-size=“12” font-color=“#000000” rotation=“0”angle=“0” x=“275.4” y=“57.208” base=“67.9” width=“57.992”height=“13.284”>d'Archiving</TOKEN><TOKEN sid=“p1_s7” id=“p1_w3” font-name=“timesnewromanpsmt” bold=“no” italic=“no” fontsize=“ 12” font-color=“#000000” rotation=“0”angle=“0” x=“336.5” y=“57.208” base=“67.9” width=“24” height=“13.284”>2007</TOKEN></TEXT>... ...
With reference to FIG. 1, an alternate view of the same XML document shown in Table 2 is shown as it might be displayed or printed. A first page 10 is shown, with two paragraphs 12, 14 shown on the page. This alternate view is useful to a user needing to see the final form of the exemplary XML document. However, the underlying details of the XML document, as shown in Table 2, are not visible or apparent in the view shown in the figure. Users having a need to view both the final form of the document, and the underlying details associated with features of the final form of the document will have difficulty associating the details of Table 2 to the corresponding features of the document in its final form as shown in FIG. 1. Further, other users may have a need to view only selected features or components of the document.
As demonstrated above, the interests of users examining the contents of an XML document vary, and the problems associated with viewing the XML document in its raw native form can be better understood by examining a few of these user needs. For example, one user may have a need to view elements with a certain attribute such as, e.g., the @PageNumber attribute of any TOKEN element, while another user may instead have a need to view any TOKEN element followed by a FIGURE element.
Further, the XML document structure to be viewed may vary. For instance, one user may wish to adapt the view to a different input XML where an element such as, e.g., <PageNumber> would replace the attribute @PageNumber in the previous example. Another user may need to visualize the XML output of a particular OCR engine.
Yet another problem arises from a need to navigate through a document, visiting only particular nodes such as chapter headings, subheadings, page nodes, and the like. In fact, users are commonly interested in only certain nodes. Visiting only those specific nodes can be tedious without the appropriate support.
One existing solution consists of modifying the source code of an XML visualizer to meet a particular requirement. However, it can be readily appreciated that this is a cumbersome and time consuming solution. Further, it is a solution that requires a particular skill set which many users may not have.
The present application provides a new and improved apparatus and method which overcome the above-discussed problems and others.