The exemplary embodiment relates to document processing. It finds particular application in extraction of elements which together constitute an image from a PDF document.
Page description languages, such as the portable document format (PDF) standard, define a set of elements which can be used individually or in combination to compose the pages of a document. These include text elements, raster graphics, and vector graphics, among others. A raster graphic, called an Image Xobject in PDF terminology, is represented by a dictionary describing properties of an image with an associated data stream, which contains the image data. Vector graphics, sometimes referred to as vectorial instructions, are based on mathematical equations, and include points, lines, curves, and regular shapes.
An image, or rather, what a human reader considers as one image, can be composed of a combination of these elements. A simple case is when one image is composed of one raster element in the PDF. In some cases, several raster images can be used to build “one” image. Vector graphics are also used, alone or with text elements, but also in combination with raster graphics.
One problem which arises is that the PDF standard does not define an image structure. This means that elements composing one image are rendered independently. The detection of the “final” image is thus done by the human reader. Hence automatic recognition of images, and the elements which compose them, is difficult.
It would be advantageous to have a document analysis system which could process such files and regroup the different elements corresponding to one image for presentation to a user, separately from the entire document, for example.
Methods for processing graphical elements in documents are disclosed, for example, in Mingyan Shao and Robert P. Futrelle, Graphics Recognition in PDF documents, in Sixth Intern'l Soc. Pattern Recognition (IAPR) International Workshop on Graphics Recognition (GREC 2005), Hong Kong, 2005; and Claudie Faure and Nicole Vincent, Detection of figure and caption pairs based on disorder measurements, in Proc. Intern'l Soc. for Optics and Photonics (SPIE) 7534, 75340S, pp. 1-10, 2010. In the first reference, the authors aim to extract sub-diagrams using horizontal and vertical separating white spaces, but do not consider sub-diagrams as a whole diagram. The second reference describes a method for extracting figures and associated captions from scanned documents from the 19th century using the geometrical relation between a figure and its caption. However, the method is unable to detect figure-caption pairs in contemporary scientific documents when a figure is a mixture of small geometrical objects, graphic lines, and text lines, as it is often the case.
OCR engines also offer a partial solution to this problem. They rely on a zoning step. Zoning in OCR is the process of creating zones that correspond to specific attributes of a page element. A zone can be identified as a non-text graphic, alphanumeric, or numeric. While effective for stand-alone photographs, diagrams are challenging for OCR processing.
Some tools, such as pdf2svg (available on the website pdftron.com) convert a PDF file into the SVG (support vector graphic) format. However, this process simply rewrites the PDF instructions into SVG ones, thereby generating an “image” of the entire page without any sub-structure.
The exemplary system, method, and computer program product address the problem of identifying images in PDF documents which allow them to be extracted or otherwise distinguished from other content of a page.