The following relates to the information processing arts. It particularly relates to correction of errors introduced into text-based documents created by conversion from a non-text format, and is described with particular reference thereto. However, the following relates more generally to correction of errors in text-based documents generated directly as text or generated through the use of conversion processing.
Document conversion to text or text-based formats is useful to promote document re-use, enable content searching, facilitate document structuring, and so forth. For example, converting documents to structured extensible mark-up language (XML), hypertext mark-up language (HTML), standard generalized markup language (SGML), or another structured format including mark-up tags facilitates an integrated document database environment employing a common document structure.
Converting a document originally formatted as a portable document format (PDF) file or other non-text format to a text format such as an ASCII file, a rich text format (RTF), an HTML document, an XML document, an SGML document, or so forth, can introduce errors. The most common errors in converting PDF to text include introducing extraneous spaces (thus “breaking up” what should be a single word), improperly removing spaces (and thus “running words together”), and inserting or retaining extraneous hyphens. Such errors can occur, for example, due to the PDF file having multiple font sizes, font styles, and/or font types, due to hyphenation of words at the end of lines of text in a page layout format, and so forth. Errors due to font size, style, type, or special font effects may occur more frequently in converted section headings, titles, and other “non-standard” text that tend to use enlarged fonts, boldface, underscores, and so forth. Errors in section headings or other document structure annotations can degrade performance of automated table-of-contents extractors or other automated document structuring operations that may be applied after the conversion to text.
When converting PDF or other formats to a structured format such as XML, another type of error which can occur is improper text flow or improper text blocking. For example, in XML mark-up tag pairs such as <PARAGRAPH></PARAGRAPH>, <TEXT></TEXT>, or so forth are typically used to delineate paragraphs or other blocks of text. On the other hand, PDF and some other page layout-based formats delineate text into physical lines on a page. When converting from PDF or another page layout-oriented format to XML, each physical line of text may be delineated by a suitable XML mark-up tag pair such as <TEXT></TEXT>, even though the physical lines on the page do not correspond to logical groupings or blocks of text.