1. Field of the Invention
The present invention relates to an apparatus, method, and computer program for analyzing layout of a document to extract blocks of document text. More particularly, the present invention relates to a document layout analyzing apparatus, method, and computer program that extract text blocks from a given document image on the basis of accuracy of text in each block.
2. Description of the Related Art
Optical character readers (OCR) are widely used today to identify characters on a document through the use of optical sensing devices such as image scanners. Their output, or recognized text data, is provided in the form of character codes. The functions of OCR can be implemented as computer software programs.
A text recognition process using an OCR device begins with capturing of an optical image of a given document containing printed characters, handwritten characters, and other objects. The OCR device locates each block of text from the scanned document image, extracts character components in the extracted text blocks, and recognizes those characters by using pattern matching or other algorithms. The text block extraction process involves tasks of analyzing the physical layout of various objects constituting a document, which include, for example, discrete characters, lines (rows of characters), text blocks, figures, tables, and cells.
Several methods have been proposed to implement the function of extracting text blocks from a given document image. For example, Japanese Patent Application Publication No. 11-219407 (1999) discloses a technique based on proximity and homogeneity of objects. Specifically, when a set of primitive elements is given, the method first identifies lines by combining such elements that are located in relatively close proximity and have similar sizes. The method then combines the lines in the same way (i.e., based on the proximity and physical homogeneity of lines), thereby identifying paragraphs, or text blocks.
Another example is Japanese Patent Application Publication No. 2-263272 (1990). According to this publication, the proposed method searches a document image to find blank areas satisfying a predetermined condition about their sizes. Text blocks can then be identified by extracting image areas other than the areas covered by those blank areas.
Many real-world documents, however, have their own unique object layouts, which are often complicated as well. The existing methods described above sometimes fail to extract correct text blocks. For example, the first-mentioned method (No. 11-219407) may overly combine character components found in a document when its text blocks are laid out in a convoluted arrangement, or when text blocks and figures are mixed in a complicated way. In such cases, two or more text lines could be recognized mistakenly as a single line. For another example, the second-mentioned method (No. 2-263272) may encounter considerable difficulties in extracting text blocks when what separate them from other objects in a document are not simple rectangles.
To solve the above problems, we, the applicants, have proposed a new document layout analysis program that can extract text blocks from a document having a complicated layout, which is filed as Japanese Patent Application No. 2004-059954. The proposed program treats blank areas in a document image as virtual separators dividing text blocks, the size of blank areas being specified as a process parameter. Each resulting text block is subjected to a validity test, and text block extraction is executed recursively while modifying the parameter value until a collection of text blocks satisfying predetermined validity requirements is obtained. This approach enables analysis of a complex document layout to extract correct text blocks.
There are, however, some documents that the above-described analysis program (No. 2004-059954) is unable to extract appropriate text blocks. We suspect that the performance limitation of this program comes from the fact that the initial value of the parameter used to find blank separators is fixed. Although the parameter changes in the course of analysis, the final result of extraction still depends on the fixed initial value of that parameter, and it is unlikely that a single fixed parameter would fit every given document. This is why the proposed analysis program sometimes produces incorrect text blocks.
Let us discuss the issue in greater depth. The proposed analysis program (No. 2004-059954) may happen to ignore a blank separator at the first cycle of its separator identification process, due to an inclination of a scanned document image or noises present on that image. Missing a separator could result in an overly consolidated text block. While the program may find a separator there in the second or subsequent cycle, the identified separator in such situations would not always be appropriate, thus leading to an overly consolidated text block after all.
When the document includes some large characters as in a subject line, the analysis program (No. 2004-059954), in the first cycle of its separator extraction process, could misinterpret a blank space within a large character image as a valid separator. If this happens, the line containing that character will be recognized as two separate lines. The analysis program, however, does not have a function of recombining such divided lines.