Semantic page segmentation is the process of identifying individual regions in an electronic document in order to identify a role of a region (e.g., tables, captions, figures, etc.). Software applications exist for performing page segmentation operations on structured documents, such as word processing documents. Structural information included within such document files typically identifies the positions and types of the various objects of the document.
Vector graphics documents, such as Adobe® PDF documents, are widely used due to their compatibility with many different software applications, operating systems, and computing devices. But these vector graphics documents typically only include information about the paths, text, fonts and other encodings to be rendered on the page, while lacking structural information used by page segmentation algorithms to identify the different page objects within a page of the document. In one example, certain PDF files of scanned documents, older versions of PDF files generated from text documents, or pictures of documents converted to PDF files fail to include any structural information.
Existing solutions for performing page segmentation on electronic documents, such as unstructured vector graphics documents, typically use complex, heuristic rules to automatically identify and tag various structural objects within the document. But these existing solutions present disadvantages. For instance, some existing solutions use region-based classifications involving heuristic algorithms. Heuristic algorithms are not self-correcting or self-adjusting, because heuristics cannot learn. Therefore, heuristics require manual correction or addition of corner cases for which the algorithm does not properly function. For example, a heuristic might identify a table simply because the text contains the word “table” and a number. But this rule requires manual correction, for example for other text that contains the word table and a number but is not contained within a table. Additionally, manual intervention to add or correct a special case to a heuristic might not be possible if a software solution has been deployed to the end user. Further, existing solutions might not take advantage of both the text and the visual appearance of the layout, derived from the rendered page image.
Furthermore, certain existing solutions cannot distinguish between objects in unstructured documents with complex layouts. For example, these solutions use region-based classification algorithms that are only able to distinguish between high-level objects, such as a figure and a text block. These solutions are unsuitable for identifying low-level features such as section headers, figures, paragraphs, captions, and the like.
Accordingly, existing solutions fail to efficiently and effectively segment or other electronic documents for reasons such as (but not limited to) those described above.