Unstructured documents (such as PDFs) are expressed as a series of stateful graphic drawing operations. These drawing operations dictate where particular characters and graphics are placed in the output as well as metadata regarding such characters and graphics. For example, the drawing operation may be to move the cursor to a particular position (e.g., 100,200), set the font, font size, and font color, and print a particular character (e.g., “W”, etc.) at that location. Next the drawing operations might move the cursor to another position (e.g., 100, 210) and print another character (e.g., “a”, etc.) at that location.
The order in which these drawing operations occur dictates the order that the characters are received as input when the text is programmatically extracted from the PDF document. However, the order that the characters appear in the PDF document is different from the order in which the output is read by a reader of the outputted document. Often, the order in which the characters are found in the PDF correspond to the order that the PDF was written and might have little relevance to the order in which a human reader will actually read the document. For example, in PDF document that includes a title that spans the entire top of the page and an article body that appears in three columns, the first characters output may be found in the middle column, followed by characters found in the first column, followed by characters found in the third column, and finally followed by the characters that form the title across the top of the page. This divergence between the order that characters appear in the PDF document and the order in which the outputted document is consumed by a reader causes many challenges for computer operations that consume unstructured documents.