It has become common in present times to exchange several documents, especially legal documents, particularly contracts, in digital form in the course of commerce, legal counseling, valuation, and the like. Commonly, most such documents are long and difficult to visualize/be navigated, both to skilled and unskilled readers. In some cases these documents are available only in plain text, in other cases as web pages or Portable Document Format (PDF) documents. In cases of lengthy and/or highly structured documents (i.e., having many sections, subsections, etc.), a table of contents is sometimes added at the beginning of the document. Nevertheless this aid, although useful, is not always the best solution when accessing the document on a digital device, since tables of content are not necessarily easily accessible to the reader while scrolling down the document, and the entries in tables of content cases may be or not in the form of a link to the related content.
The ease of navigating legal documents cannot be easily improved by the parties involved, due primary to the necessity for the parties to maintain legally valid document formalities. Typically, e.g., the parties to a contract in principle need to (i) ensure that each section, sentence, and word in the contract has a sufficient level of readability, and (ii) keep a similar level of readability of the document whether it is consulted in electronic format or in hard copy.
This situation leaves a need for improved document navigability, particularly in digital form, under less formal circumstances. In order to implement techniques to improve the document navigability, it is useful to identify the structure of the documents having hierarchies of sections and subsections. Several known methods perform structure identification. These known methods, however, suffer from problems preventing their widespread use. For example, some document analyzers work only for documents with a pre-existing table of contents. Others perform analysis merely based on formatting and style, and therefore only work with a limited number of documents, thus escaping wide adoption. Yet others are limited to left-to-right languages, particular formatted documents, or are limited to alphabetic languages, only.
Embodiments described herein address these and other limitations of the prior art.