1) Field of the Invention
The present invention relates to a technology for extracting an area that includes a character from image data.
2) Description of the Related Art
In general, a process of sorting a document image input into a computer through an image input device such as a scanner and a digital camera, into document constituent elements, namely character blocks, photographs/pictures/illustrations, tables, and ruling, is called “geometric layout analysis” or “page segmentation”. The “geometric layout analysis” or “page segmentation” is often carried out onto a binary document image. In addition, “geometric layout analysis” or “page segmentation” is associated with, as preprocessing, “skew correction” in which a skew caused at the time of inputting is corrected. The “geometric layout analysis” or “page segmentation” of a binary document image that has been subjected to the skew correction is divided into two broad approaches (top-down analysis and bottom-up analysis).
Explanation of the top-down analysis will now be given. The top-down analysis breaks a page into large constituent elements, then into smaller constituent elements. This is an approach where larger constituent components are broken into smaller components; for example, a page is broken into columns, each column is broken into paragraphs, and each paragraph is broken into character lines. The top-down analysis is advantageous in facilitating calculation by using a model on the basis of assumption on a page layout structure (character lines in a Manhattan layout are upright rectangular, for example). If the assumption does not hold for the data, however, there is a drawback that a fatal mistake may be created. For a complicated layout, modeling also becomes complicated in most cases, and thus it is not easy to deal with such a layout.
Next, an explanation will be given on bottom-up analysis. In the bottom-up analysis, constituent elements are integrated by referring to positional relationship with adjacent elements, as described in Japanese Patent Application Laid-open No. 2000-067158 and Japanese Patent No. 3187895. This is an approach where smaller constituent elements are grouped under large elements; for instance, connected elements are put together into a line, and lines are put together into a column. Japanese Patent Application Laid-open No. 2000-067158 discloses a bottom-up analysis method that is based on local information. Although this can cope with various layouts without depending much on the assumption regarding the layout of the entire document image data, there is a drawback that locally made judgment errors may be accumulated. If two words across two different columns are mistakenly integrated into one character line, the two columns are mistakenly extracted as one column. Furthermore, the method of integrating constituent elements as disclosed in Japanese Patent No. 3187895 requires knowledge on features of character sequences and writing orientation (vertical-writing or horizontal-writing) for each language.
As explained above, the two approaches are complementary to each other, and some approaches are suggested in the efforts of filling the gap therebetween. Among these, there are approaches that are independent from differences in languages. These approaches include an approach that uses portions other than characters, i.e. “background” or so-called “white background” for binary document images. Advantages in use of background or white background are:
(1) Because it does not matter which language it is dealing with (white background is used as a breakpoint in most languages), knowledge on writing orientation (vertical-writing or horizontal-writing) is not required.
(2) It is broad processing, which is less likely to have local judgment errors accumulated.
(3) It can flexibly cope with complicated layouts.
Among such background analysis methods, the “maximum white-block group page segmentation” is a typical method.
The “maximum white-block group page segmentation” will be briefly explained here. Preparatory to this, the “maximum while block problem” will be defined. First, rb is assigned to indicate a block area corresponding to the entire document image data, and C=[r0, r1, . . . , rn](ricrb; i=0, 1, . . . , n) is assigned to indicate a block area enclosing combined black components of a binary document image. An exemplary set of block areas is shown in FIG. 6. Further, an evaluation function Q that satisfies the property described below is introduced for blocks. Regarding two blocks r and r′, the evaluation function satisfies:if r⊂r′then Q(r)≦Q(r′)For instance, the above property is satisfied when the function Q(r) is the area of the block r. The “maximum while block problem” is a problem of finding the maximum value for Q from among blocks that do not overlap with elements of C, r0, r1, . . . , rn (ricrb; i=0, 1, . . . , n). In the extension of this problem, H. S. Baird, “Background structure in document images” in Document Image Analysis (H. Bunke, P. S. P. Wang, and H. S. Baird, Eds.), Singapore: World Scientific, 1994, pp. 17-34 and T M Breuel, “Two algorithms for geometric layout analysis”, in Proceedings of IAPR Workshop on Document Analysis Systems (Princeton, N.J., USA), 2002 suggest an algorithm for providing values of Q in descending order regarding the “maximum white-blocks”, i.e. white-blocks that would overlap with any of the C elements if they are expanded any further.
By covering the background area (blank area of the binary document image) with a group of maximum white-blocks in a manner as described above, it is expected that document constituent elements such as columns and text lines can be extracted as “portions uncovered by any of the white-blocks”.
However, the methods that belong to the background analysis such as the “maximum white-block group page segmentation” have a drawback that it is difficult to deal with complicated layouts specific to a language.