Often electronic content data do not consistently adhere to one standard on format, organization, and use in consistent software. For example, each individual content data creator may choose to save electronic content data in various formats. This heterogeneous nature of the electronic content data can pose challenges when various content need to be extracted, edited, re-purposed, re-styled, searched, combined, transformed, rendered or otherwise processed. Content may be encoded at an inconsistent and/or inappropriate semantic level. In some cases, a PDF (Portable Document Format) document is generated from a virtual printer driver and includes geometrical properties of content elements, e.g., a vector graphic, bitmap, or other description of such content elements, but does not include higher-level semantic structure. For example in a document containing text, text flow of lines in the same horizontal position of two separate columns can be incorrectly flowed together as a single line. This causes extraction of a single column, e.g., to “copy” and “paste” to another document a paragraph in a particular column, to be difficult. In some cases when converting the format of the content, many standard tools for format conversion operate in a manner that can potentially cause semantic information needed to perform desired processing, for example, to be lost. Therefore, there exists a need for a better way to reconstruct semantics of content.