The invention is generally related to electronic data files. More particularly, the invention is related to extraction of a section of a portable document format document.
Electronic files may be created using a variety of techniques. Thus, it may be desirable to store data from an electronic file in a format that is independent of the process used to create it so that it may be accessible to a range of users. One format that allows such access is the portable document format. The portable document format (xe2x80x9cpdfxe2x80x9d) is a file format for representing documents in a manner independent of the application software, hardware, and operating system used to create the documents and independent of the output device on which they are displayed or printed.
A PDF workflow assumes a one-way production process where the PDF file contains a rendition that is laid out for final presentation, i.e., no logical structural information is preserved. Consequently, one problem with storing documents in a pdf format is that it is difficult to reuse parts of documents because elements with semantic affinity are not stored as one logical group of elements. Although it is possible to store the original editable document as an attribute in the PDF file, this is not generally done, since the original program for creating the pdf document is unavailable anyway, or because this introduces a vulnerability for computer viruses. Without the original editable document, removing a portion of the pdf document for use in another document or file is not easily accomplished. For example, it may be desirable for a user to insert a graph or chart from a pdf document into a document of the user""s own creation or make a slide presentation with the graph or chart. The PDF specification makes an allowance to include structural information, however, very few pdf documents are created with such structural information due to size constraints and/or creation processes. Thus, most pdf documents do not generally support sharing or repurposing the content of the document and it is generally not possible to extract a figure, an illustration or a paragraph from a chapter as an integrated object from PDF.
There are a few techniques available for reusing pdf document content. However, some of these processes are complicated and require extensive user interaction, while others extract a raster rendition of the selected document portion from the display bitmap, thereby losing all original document structure and attribute information, as well as resolution, which is usually limited to the 72 dpi screen resolution.
An aspect of an embodiment of the invention is to provide a method for extracting a section of a portable document format (xe2x80x9cpdfxe2x80x9d) document.
In one embodiment, the method may include receiving indication of a user defined region on a pdf file page, determining if each element on the pdf page is within the user defined region, designating an extraction region including all elements determined to be within the user defined region, and placing the extraction region into a new pdf file.
Those skilled in the art will appreciate these and other advantages and benefits of various embodiments of the invention upon reading the following detailed description of preferred embodiments with reference to the below-listed drawings.
Another aspect of the invention includes checking the extracted region for accuracy. In one embodiment, both the extracted region and the region in the original document may be converted to bitmap images and compared bit by bit.