Prior to optical character recognition (OCR), an OCR unit recognizes frames in document image data that includes frames in certain forms. For example, the OCR unit generally recognizes the frames at their predetermined positions with respect to the document image data. Similarly, a mark recognition unit also relies upon predetermined positional information of the mark which indicates the frames in the document image data. Japanese Laid Patent Publications Heil 1-66225 and Hei9-138837 disclose recognition techniques for determining horizontal and vertical ruled lines based upon a comparison of black pixel runs to a predetermined threshold value and extracting an area enclosed by the four lines as a frame.
Most of the frames recognized in OCR are arranged in two dimensional arrays. FIG. 1 illustrates frames that are arranged in a table format. The frames extend both in the X and Y directions, and at least two of the four sides are touching with those of adjacent frames. FIG. 2 illustrates frames that are arranged in a ladder format. The frames extend only in the X direction, and one or two sides are touching with those of adjacent frames. FIG. 3 illustrates frames that are arranged in an independent format. The frames extend only in the X direction, and no sides are touching with those of adjacent frames. The frames as shown in FIGS. 1 and 2 are defined as complex frames while those as shown in FIG. 3 are defined as simple frames.
In general, complex frames are more readily recognizable by OCR than simple frames. Even though the complex frames are small in size, since the length of ruled lines of each frame is sufficiently longer than the corresponding character size, the above prior art OCR techniques recognize the frames. On the other hand, since simple frames such as check boxes or single-character boxes are generally equal to or smaller than the corresponding character size, it is difficult to recognize these simple frames based upon prior art recognition techniques. In the prior art recognition techniques, the length of continuous black pixels is compared to a predetermined value in order to extract ruled lines. When the predetermined value is lowered in attempt to accommodate smaller frames, the ruled line candidates are erroneously extracted from character regions and the accuracy is undesirably decreased.
Furthermore, the ruled lines used in the above formats are generally thin. When these formats are scanned by a scanner, the ruled lines are sometimes faded. In particular, when the lines that are printed in light color are scanned by a black-and-white scanner, the ruled lines are frequently faded. To compensate the fading, if the scanner is adjusted to read the thin or light ruled lines in the above formats, since input characters tend to be incorrectly scanned, it is useful to set the scanning sensitivity at a high level.
For the above described above reasons, it is desirable to provide a frame recognition technique to recognize at a high precision a single letter frame and a check box that is approximately equal to or smaller than the size of corresponding characters. It is also desirable to provide a frame recognition technique to recognize at a high precision a frame with faded frame lines.