1) Field of the Invention
The present invention relates to a technology for analyzing a layout of a document image corresponding to a form. More particularly, this invention relates to a technology for analyzing a layout of lines as character strings and paragraphs each including the lines with high precision, even when a line is branched into a plurality of lines in the middle of the line or when there are lines within parenthesis in a document image.
2) Description of the Related Art
A conventional method of analyzing a layout of characters and lines in a document image has been disclosed in, for example, “Document picture layout analysis device” under Japanese Patent Application Laid-Open No. 7-192083. According to this conventional method, in a document image in which different character sizes coexist, a plurality of circumscribed rectangles (characters) are classified into groups each having the same character size based on an area of the circumscribed rectangle corresponding to each character. The results of analyzing the layouts of these classified groups are combined using priorities of the layouts.
According to this method, projection patterns of the circumscribed rectangles are obtained, and the layout of the lines is analyzed by taking into account the periodicity of the layout of the lines.
According to the conventional method of analyzing the layout, it is possible to discriminate between different sizes of characters, but it is not possible to analyze the layout of a line with high precision when the line is branched into lines in the middle of the line as shown at a portion 10a of a document image 10 shown in FIG. 41A. This is because the layout analysis is carried out using the projection pattern of the circumscribed rectangle as described above.
Further, in the conventional manner, as the line layout is analyzed by using the projection pattern of the circumscribed rectangle, it is not possible to analyze the layout of a line with high precision either, when a plurality of lines exist within parenthesis as shown at a portion 20a of a document image 20 shown in FIG. 41B and at a portion 30a of a document image 30 shown in FIG. 41C.
The problems occur for the following reason. As a circumscribed rectangle formed from the parentheses has a vertically elongated shape in these FIG. 41B and FIG. 41C, the horizontally oriented lines within the parenthesis are hidden in the projection pattern, that is, the lines are not recognized as lines, and therefore, the parenthesis including the lines is analyzed as one line.