Documents formatted in a portable document format, such as PDF, are commonly used to simplify the display and printing of structured documents. Such documents permit incorporation of a mix of text and graphics to provide a visually pleasing and easy to read document across heterogeneous computing environments. It is estimated that there are currently about 2.5 trillion files on the World Wide Web encoded as PDF documents.
It is often necessary to extract text from a document encoded in a portable document format to, for example, (1) read a document out loud, (2) reflow a document for viewing on the small screen of a mobile device, (3) help make the reading of PDFs more accessible for visually impaired and motion-impaired users, (4) copy text for pasting into another document, or (5) analyze document text, search for phrases, operate on text such as summarization, export to another format. Current tools can identify contiguous portions of text but unfortunately do not accurately identify discontinuous portions of text, for example, text that may be in multiple columns and that may be interspersed around images or other visual elements.
Generating documents with tags to indicate portions of text can help, but many existing documents are not tagged, and tagging tools cannot always correctly tag existing documents. Segmenting and labeling segments such as “title” and “body” has been proposed as a solution. Use of spatial information within a document to determine document structure has also been proposed, as has topological sorting to determine reading order. Unfortunately, none of these solutions provides a sufficiently flexible solution for the large number of existing documents encoded in a portable document format.