The conventional document image recognition algorithm is shown as the flowchart in FIG. 1A. FIG. 1B shows an exemplary arrangement of a conventional document image recognition apparatus. Firstly, in S101 means 112 divides a character image input (for example by scanning) by the input means 111 into character image lines. In S102, means 113 splits the characters in each image line with one another. Means 114 extracts the feature of each split character, then matches and recognizes the characters. In S105, output means 115 output the results of recognition. In a method for document image recognition, the accuracy of line dividing of an image will directly influence the accuracy of the final results of character recognition.
The conventional character image line dividing algorithm is shown as the flowchart in FIG. 2. Firstly, in step S201 an input document image is divided into several image segments by a certain width in the horizontal direction (for example, a width of 400 pixels). Step S202 performs the calculation and recording of the number of black pixels contained in each pixel-row of a width of 400 pixels. In step 203, each image segment is divided, in the vertical direction, into a plurality of segment blocks according to the blank pixel-rows (i.e., the pixel-rows in each of which the number of black pixels is 0) in the image segments. And the information about the segment blocks, for example the width, height and position, is recorded. In step S204, the average height of the segment blocks and so on are calculated, as the standard for further dividing over-large segment blocks and merging over-small segment blocks. In step S205, the over-large segment blocks are further divided according to the average height of the segment blocks. In step S206, the segment blocks are checked, with the over-small segment blocks merged into adjacent segment blocks. In step S207, the segment blocks are integrated into image lines according to the positions of the segment blocks.
For example, in FIG. 3, the document image can be divided into two image segments in the direction of width. With respect to the first segment, the distribution statistic of the black pixels in each pixel-row of the segment is shown as FIG. 4, wherein the abscissa represents pixel-rows in the segments, the ordinate represents the number of black pixels in a respective pixel-row. As to the second image segment, the distribution statistic of black pixels in each pixel-row is shown in FIG. 5.
If the character image in FIG. 3 is divided using the conventional algorithm (see FIG. 2), firstly by using the distribution statistic of pixels in each pixel-row (see FIGS. 4 and 5), the two segments are respectively divided into a plurality of segment blocks according to the blank pixel-rows, in which the number of black pixel is 0. Then the average height of the segment blocks is calculated, and used as a standard for further dividing the divided segment blocks. The over-large segment blocks in each segment, which exceed the average height of the segment blocks to a predetermined extent, are further divided according to the peak-valley relation in the graph of the distribution statistic of black pixels in the interested segment. The segment blocks in each segment, which are lower than the average height of the segment blocks to a predetermined extent, are merged into adjacent segment blocks. However, since the average height of the segment blocks is calculated only once, and the average height of the segment blocks is not re-calculated after an over-large segment blocks is further divided. This is obviously unreasonable. It results in that when the segment blocks which actually need to be further divided, are processed, since their heights do not reach the standard of being necessary to be divided, they are further processed in later procedures (the procedure of splitting the image lines into characters) as reasonable segment blocks, thereby recognition errors occur.
By dividing the document image in FIG. 3 into image lines according to the flowchart shown in FIG. 2, the result of character recognition is as follows:
The original result: −,′i.,gl″# csa&!sli, tllgiertEwide,i& ..,′,;sild t Ab,.ff& ′W.
Thus it can be seen that because of the errors in line dividing, the original 21 lines of effective text are only divided into 8 lines. And due to the errors in the positions and sizes of the image lines, the recognition result is very poor.