Aspects of the present invention relate to preparing a display document for analysis, and, more specifically, to an apparatus and method for extracting and manipulating text order to prepare a display document for analysis.
A display document, such as a Portable Document Format (PDF) document, is a document which is primarily intended to convey content visually to a user. In many cases, the display document may be graphical in nature or may have a combination of graphical and textual structures. Text associated with a display document is extracted from the document prior to text analysis being performed. It is desirable for the extracted text to be in logical (i.e. reading) order prior to text analysis being performed. Some document formats, particularly those that are intended for display purposes (e.g. PDF), display text in left-to-right (LtR) order regardless of whether a language associated with the text has an associated LtR logical order (e.g. English) or right-to-left (RtL) logical order (e.g. Arabic). If text is associated with an RtL logical order and an associated document is displayed in an LtR order, current text extraction tools extract the text in the displayed order (e.g. LtR order). Thus, the extracted text is not suitable for text analysis because the text is in a (reversed) LtR order.