Optical character recognition (OCR) technology may be used to extract text from an image document such as a portable document format (PDF). OCR may detect lines of text, words in each line, a bounding box, and text for each word. Complex layout elements and properties of the image document, such as paragraphs, tables, columns, and footnotes, may be recognized, and the image document may be serialized to any flow document, such as an Open XML Format Document file (DOCX) or a HyperText Markup Language (HTML), for example.
If the image document includes several images to be converted into the flow document, additional analysis may be needed to detect sections, where a section is based on a set of major document properties. Current technology used to convert image documents to flow documents may focus on matching visual fidelity rather than preserving flow in the conversion. As a result, the flow document may include a section break at an end of each page and unrecognizable section properties.