Page segmentation is the process of identifying each individual element, e.g., text block, table, figure, etc. appearing on each page of an electronic document. Software applications exist for performing page segmentation operations on “structured” documents, such as Microsoft® Word and other word processing documents. Structural information included within such document files typically identifies the positions and types of the various elements of the document.
Electronic documents such as Adobe® PDF documents, PostScript documents or other documents created using page description languages use a vector graphics model to define how page images are to be rendered on a display. These types of documents, which are referred to herein as “vector graphics documents,” typically contain drawing commands that are interpreted by a compatible rendering application to render the page image(s) of the document. For example, drawing commands may incorporate or reference other information stored in the document file that specifies the paths (i.e., lines, curves) for drawing text and other graphics, as well as visual properties like text size, fonts, typeface, and other encodings to be rendered on the page. Some vector graphics documents may not contain structural information or other higher-level information identifying the different page elements within each page of the document. For example, documents created using older versions of PDF, PDF files of scanned documents, and images (e.g., JPEG images or TIFF images) of documents converted to PDF files may not include any structural information.
Vector graphics documents are widely used due to their compatibility with many different software applications, operating systems, and computing devices. The ability to determine or recover structural elements in an unstructured vector graphics document is crucial to the ability to intelligently reflow document pages on different types and sizes of display screens, making the document accessible to the visually impaired, and enabling higher level understanding of documents. For example, using structural information, each paragraph on a page can be identified and one paragraph at a time can be displayed on a small mobile phone screen. If the positions of tables and figures are known, their contents can be analyzed for further structural information. Such further structural information can potentially allow users to sort tables based on different columns or to query the data contained in the tables or figures.
Existing solutions for performing page segmentation on unstructured vector graphics documents typically use a set of complex, heuristic rules to automatically identify and tag various structural elements within the document. Heuristic algorithms are not self-correcting or self-adjusting, and thus require manual correction or addition of corner cases for which the algorithm does not properly function. For example, a heuristic algorithm may identify a table simply because it contains the word “table” and a number within the text. However, such a rule may not work for every variation. Manual intervention to add to or correct a heuristic algorithm to account for a special case may not be possible when a software solution has been deployed to the end user. Heuristics typically cannot consider information beyond the document itself. Existing solutions may not analyze embedded images within a document. Additionally, these solutions operate on only the vector graphics document, and do not perform any analysis on rendered page images.
Other existing object recognition methods are used to identify objects appearing in images of naturally occurring objects, e.g., photographs, drawings or other images of people, animals, plants, etc. While very effective on images of naturally occurring objects, these advanced image processing techniques for object recognition are not easily applied to the recognition of human-created objects or constructs (e.g., tables, charts, and paragraphs) that may be included in unstructured documents. For example, given a partial image of a person's body, it is a relatively straight-forward exercise to extrapolate and predict the overall dimensions and shape of the person's entire body. However, if half a table is shown on one page of an unstructured document, prediction of the total number of rows of the table could be nearly impossible.
Accordingly, solutions are needed to more efficiently and properly analyze existing unstructured vector graphics documents to perform page segmentation on such documents.