The present invention relates to techniques that identify and categorize paragraphs, subparagraphs, and structural groupings in electronic documents, and more particularly to techniques that build a structure hierarchy from structural groupings.
An electronic document typically has information content, such as text, graphics, and tables, and formatting information that directs how the content is to be displayed. An electronic document resides on a digital, though not necessarily electronic, computer storage medium. An electronic document is generally provided by an author, distributor, or publisher who desires that the document be viewed with the appearance with which it was created. Electronic documents may be widely distributed and, therefore, can be viewed on a great variety of hardware and software platforms. A hypertext document is an electronic document with links, which are explicit, user-selectable navigation elements.
Generally, electronic and human perceptible documents include a set of paragraphs. Each instance of a paragraph shares characteristics with other paragraphs. Paragraphs that share visual characteristics can be considered the same structural type. Examples of structural paragraph types are titles, headers, and footnotes.
In addition, in all documents, paragraphs can have subparagraphs, which are character streams. Each instance of a subparagraph shares similar characteristics with other subparagraphs that are the same structural type. Examples of subparagraph structural types are book titles, quotations, and foreign words and phrases.
A document typically has a logical organization. Within the logical organization are identifiable structural groups. A series of chapters containing paragraphs is an example of a structural group, as is a section that contains a heading, several paragraphs, and a bulleted list.
Organizing components in an electronic document by structural type permits an electronic document development system to perform global operations on all instances of the same type within the electronic document. For example, the FrameMaker.RTM. document publishing system, available from Adobe Systems Incorporated of San Jose, Calif. can globally change the justification of all paragraphs tagged as a particular type in the electronic document and can globally change the font size of all characters tagged as a particular type in the electronic document.
Standard type formats exist for particular uses and for particular systems. For example, the HyperText Markup Language (HTML) uses the embedded tags &lt;P&gt; and &lt;/P&gt; to delimit paragraphs, and &lt;B&gt; and &lt;/B&gt; to delimit bold text. HTML also specifies many other tags including tags for titles, menus, definitions, quoted blocks, and heading styles. For an electronic document to have the desired visual appearance when viewed with a World Wide Web browser, the electronic document must have the appropriate HTML tags.
When viewed on paper or on a computer display, the different structural paragraph types in a document, such as headings and lists, are readily identifiable. However, to enable a system to perform operations based on structural types, such as modifying, rearranging, displaying, or printing a document, will generally require that someone examine and tag all paragraphs and subparagraphs manually according to their visually recognized structural type. This is tedious and time consuming, and often an impracticable process for large documents.