The following relates to the document creation, processing, storage, display, and related arts.
There is interest in conversion to a selected structured format of hardcopy documents or documents in specialized or proprietary electronic formats such as certain word processing formats, certain spreadsheet formats, certain presentation formats, and so forth. The selected structured format is typically hypertext markup language (HTML), extensible markup language (XML), standard generalized markup language (SGML), or another structured format having defined markup formatting syntax or rules. Such conversion ensures the documents do not become unusable in the event that software capable of reading the specialized or proprietary format becomes unavailable. Conversion also facilitates indexing and creation of knowledge databases or other searchable document repositories.
Document conversion typically begins with obtaining the document in an unstructured or undesirably structured form. For electronic documents such as word processing documents, this entails identifying the document by filename and file path, by URL, or so forth, and in some cases performing some initial format conversion operations. For a hardcopy document, this entails optically scanning the document and performing optical character recognition (OCR) to generate an unstructured or shallowly structured electronic text-based copy. The obtained electronic document is segmented into lines and tokens or other word-size elements, and may be provided with some shallow structuring such as demarcation of paragraphs or pagination. There are commercially available products, such as FineReader (available from ABBYY USA Software House, Fremont, Calif.), that provide scanning and OCR of hardcopy documents and further provide tokenization and conversion of the document into a shallow or largely unstructured XML format. The unstructured or shallowly structured document provides the basis for further analysis and marking up of structural features of interest. For example markup tags or other structural document formatting can be used to mark features such as chapters, sections, tables, and so forth.
One feature of interest is reference notes, such as footnotes, endnotes, table notes, and so forth. Usually, a note includes two parts: (i) a reference mark such as a raised superscript number or symbol in the body of the text, table, or other structure that draws the reader's attention to the note; and (ii) a replication of the reference mark at the bottom of the page (for footnotes), or at the end of a section or document (for endnotes), or after a table (for table notes) followed by the note text. Identification of notes is useful both to enable the document to be marked up to indicate the note, and to ensure that the note is not misinterpreted during document analysis. If not recognized as a note, it is possible for the reference mark to be misinterpreted as part of a word, or for the note text to be misinterpreted as a section heading, paragraph, list item, or other structure. Such misinterpretation can in turn lead to misspelled words, improper text flow, or other incongruities in the marked up document.
An existing technique for identifying footnotes employs recognition of a bottom-of-page separating horizontal line that is sometimes used to separate the footnotes from the body of text on the page. This approach is robust if the document uses such a separating horizontal line and the line is retained during optical scanning and OCR, but is inoperative otherwise. Moreover, this approach does not work for endnotes. Other techniques utilize layout information such as font size to identify reference marks of the note. These techniques can be overinclusive if the document uses such layout features to denote other document elements, and can be underinclusive if the document uses a mechanism such as brackets to set off the reference marks. Moreover, these techniques are not useful if the OCR or other processing fails to retain the layout features relied upon to identify the reference marks.