1. Field of the Invention
The present invention relates to a technology of converting raw images to vector data.
2. Description of the Related Art
In recent years, digitization of information has given rise to dissemination of systems which store or transmit electronic documents generated by digitizing paper documents, instead of on paper. Documents subject to such digitization are spreading from black and white binary documents to full color (multivalued) documents.
The term “electronic documents” here not only refers to image digitization of paper-based documents using an image reading apparatus such as a scanner, but also includes image data resulting from region segmentation performed on obtained document images on a per-attribute basis and post-processing performed according to each region. Examples of such post-processing include, for text regions, processing involving character recognition for conversion into character code strings. In addition, examples for line art regions include processing involving conversion into outline vector data.
Conventionally, many attempts have been made to create such electronic documents. A conventional example of region segmentation of document images is described in Japanese Patent Laid-Open No. 2002-314806.
This literature discloses a configuration in which a binarized image of an inputted color image is generated, and the generated binarized image is segmented into regions including, for instance, a text region, a line art region, a photographic region (picture region) and the like. The region segmentation method used therein involves calculating connectedness of a binary image to determine the sizes of clusters of black pixels, and segmenting the binary image into a text region, a line art region, a picture region and the like while collating the characteristics of each region.
In addition, a conventional example of outline vectorization for converting into outline vector data is described in Japanese Patent No. 02885999. In this literature, contour lines are vectorized by performing contour line tracing on a binary image and selecting obtained coordinate vectors. Furthermore, obtained vector data may even be used in a CAD system by substituting the vector data with a graphic instruction for drawing a polygon or the like.
A sample of a document image will now be described using FIG. 13.
This document image is printed on a recording paper by an output apparatus such as a printer. As for characters, large characters, such as a title, or relatively small characters, such as descriptive text, are arranged in the document image. In addition, images are composed of a photographic image and an image (such as an illustration image or the like) which contains a relatively smaller number of output colors as compared to photographic images (natural images). Herein, images with a relatively small number of output colors will be referred to as clipart images.
By reading some printed material on which the document image is printed by an image reading apparatus such as a image scanner, and performing region segmentation processing on the read image, a text region 23, a photographic region 21 and a clipart region 22 are obtained, as shown in the drawing.
In addition, with regard to the clipart region 22, a separate “region segmentation processing” is performed on the image comprising the clipart region 22 to collect same-colored portions and fusing such portions into one region. Next, vectorization processing is performed on the obtained same-colored region. Through this vectorization processing, vectorization of each region obtained by segmenting the clipart region according to color may be conceivably realized by representing each obtained same-color region by its contour line and internal color information.
However, with the above-described region segmentation processing within a clipart region, the following problems occur.
These problems will now be described using FIG. 14.
FIG. 14 is a diagram for explaining an example of region-segmenting a clipart region into same-colored regions.
Reference numeral 30 denotes an example of a processing object raw image. Reference numeral 31 denotes an example of a region image (contour line image) segmented from the read raw image. Reference numeral 32 denotes an example of an edge image obtained by performing edge extraction processing on the read image.
As depicted, contours differ for each image. Such differences in contours may be attributed to misalignments of extracted contours caused by variations in density around the edges which result in thinning of colors, or occurrences of false colors, due to variations in level upon reading (blurring, read resolution and the like) or deterioration in image quality caused by image compression.
In FIG. 14, the shapes of the contour lines of the raw image 30 and the edge image 32 are relatively similar, while the shapes of the contour lines of the region image 31 and the edge image 32 (or the raw image 30) are considerably different. Therefore, vectorization processing performed on a clipart region obtained in this manner will not generate vector data capable of faithfully representing the configuration of the raw image.
Additionally, for instance, when performing vectorization processing based on the above-described contour extraction on a clipart image in which the color green was clear to begin with, the green may disappear during binarization due to noise. Furthermore, even with vectorization processing based on the above-described region segmentation, there is a problem in which the green portion is segmented into a large number of clusters, or a problem in which non-green portions and green portions are erroneously placed in the same cluster. Such problems may result in an increase in vector data volume, or segmentation in inefficient forms upon componentization.