1. Field of the Invention
The present invention generally relates to a method and an apparatus for extracting a raster image from a portable electronic document, and more specifically to a method and an apparatus for extracting a raster image from a portable electronic document by analyzing a format of the portable electronic document.
2. Description of the Related Art
Portable electronic documents, such as portable electronic document format (PDF) documents or PostScript (PS) format documents, are widely used in daily clerical work. The portable electronic documents have an electronic document format for displaying documents, and such portable electronic documents are generated and output in a manner independent of the application software, hardware, and operating system.
The portable electronic documents define recording systems for two types of raster images, namely Inline-images and Image XObjects. PDF commands and image data corresponding to the Inline-images are all stored in a contents stream section of a page, whereas PDF commands corresponding to the Image XObjects are stored in a contents stream section of a page and image data corresponding to the image XObjects are stored in a resources section of the page.
A raster image is called a bitmap image that is displayed based on pixels in the image, and is discriminated from a vector image obtained by plotting a sequence of control points in the image and connecting paths between the plotted control points. It is generally known that extracting components such as paragraphs and tables from the portable electronic documents is a difficult task. For example, when a raster image is extracted from a PDF file using Adobe Acrobat (Trademark) software, the extracted image often results in undesired images. With Adobe Acrobat (Trademark) software, an Inline raster image embedded in the PDF document is difficult to be extracted. For example, Adobe Acrobat Reader (Trademark) can only extract an image XObject raster image from the PDF file.
Generally, a visually intact raster image in the PDF file is not composed of an intact image but of segments of the image that are linked together; that is, plural linked image segments are extracted from the image by Adobe Acrobat (Trademark) software so as to be rendered as an intact raster image.
Further, borders in a table are represented with plural long and thin raster images in the PDF file, which can be extracted by Adobe Acrobat (Trademark) software; however, such long and thin raster images are generally not perceptually significant contents of detection or search. For example, since such long and thin raster images contain little significant characteristics for detection or search, users generally make no attempts for detecting or searching for such perceptually insignificant long and thin raster images in the PDF file.
U.S. Pat. No. 5,832,530 A discloses a technology for extracting a word in a PDF file. This technology involves identifying a word composed of characters in text segments in the PDF file by detecting a break word (space) between words, or by detecting a space between adjacent characters in text segments. If the space between the adjacent characters in text segments exceed a predetermined threshold value, adjacent characters in text segments are identified as two words. In the technology disclosed in U.S. Pat. No. 5,832,530 A, an input is a PDF file and an output is a collection of words.
U.S. Pat. No. 6,801,673 B2 discloses a technology involving a tool for extracting content segments from a PDF file. In this technology, a user specifies an intended extraction region with a rectangular box, extracts the specified rectangular extraction region in a PDF browser interface, and stores the extracted content segment (i.e., rectangular extraction region) as a new PDF file. In this technology, although a PDF command in the PDF file is extracted and pasted, a document content having perceptually insignificant information on an image or a table is not extracted.