1. Field of the Invention
The present invention relates to an apparatus for recognizing printed characters.
2 Description of the Prior Art
To automatize a typographic printing process in which the operator picks up type, it is necessary to employ a character recognition apparatus for recognizing characters on a typed or printed document and converting the recognized characters into character codes.
FIG. 1 of the accompanying drawings shows a conventional character recognition apparatus disclosed in Japanese Laid-Open Patent Publication No. 62-74181. As shown in FIG. 1 , the conventional character recognition apparatus has a document reader 1 such as an image scanner which supplies an original character signal S1 that represents optical densities of one page of a document to a character line extractor 2. The optical densities of the document are expressed by dots spaced at a predetermined dot density, and the original character signal S1 is composed of values of "1" each indicating a black dot, for example, and values of "0" each indicating a white dot, for example. The optical density of each of the dots may be expressed by a binary number composed of plural bits.
The character line extractor 2 comprises a first preprocessor 3, a second preprocessor 4, and a third preprocessor 5. The first preprocessor 3 processes the original character signal S1 to remove noise therefrom and also to correct the document as represented by the original character signal S1 out of any rotated condition. The second preprocessor 4 separates a character area AR from other areas that contain photographs, graphic patterns, etc. in the original character signal S1, and extracts only image data contained in the character area AR. The third preprocessor 5 extracts character line signals S4 corresponding respectively to character lines AR1, AR2, . . . contained in the separated character area AR.
The character line signals S4 are extracted as follows: As shown in FIG. 2, the positions of respective dots in the character area AR are expressed according to an X-Y coordinate system having a horizontal X-axis and a vertical Y-axis. The values of "1" or "0" of the respective dots are projected onto the Y axis and added into sums representing Y-axis projected signals Sy. The Y-axis projected signals Sy are converted into respective binary signals using a predetermined threshold value, and the intervals having the value "1" according to the binary signals correspond to the respective character lines AR1, AR2, . . . , respectively.
The character line signals S4 from the character line extractor 2 are then supplied to a character extractor 6. In the character extractor 6, the character line signal S4 of an ith character line ARi (see FIG. 3A), for example, is projected onto the X-axis, with the values of "1" or "0" thereof being added to form a sum which represents an X-axis projected signal Sx, as shown in FIG. 3B. Then, the X-axis projected signal Sx is converted into a roughly extracted signal DT1 (see FIG. 3C) with a threshold value TH1 having a minimum level (i.e., a value of 1) as shown in FIG. 3B, and also converts the X-axis projected signal Sx into a finely extracted signal DT2 (see FIG. 3E) with a threshold value TH2 having a medium level as shown in FIG. 3D. An extracted signal in the Y-axis direction can be generated by generating a Y-axis projected signal Sy in each of the intervals of the roughly extracted signal DT1 that have the value "1".
Finally, the character extractor 6 produces a signal which has a value of "1" within a circumscribed frame 9A (FIG. 3A) of a Japanese hiragana character " ", for example, and which has a value of "1" within circumscribed frames 9B, 9C of the separated elements of a separate Japanese hiragana character " ", for example. The character extractor 6 produces a succession of such signal values of "1" from the character line signal S4 and outputs the same as a basic rectangular extracted character signal S7.
The finely extracted signal DT2 shown in FIG. 3E is used when the structures of the respective characters are to be analyzed in detail. The separate character " " shown in FIG. 3A needs to be integrated in a subsequent character recognition process because the character has two circumscribed frames 9B, 9C of its elements. As shown in FIG. 1, the basic rectangular extracted character signal S7 is supplied from the character extractor 6 to a character recognition unit 7 which reads each of the circumscribed frames of the basic rectangular extracted character signal S7 for character recognition. More specifically, essential features of the dot patterns of respective characters are extracted to classify the dot patterns into groups. Then, pattern matching is effected on the dot patterns within the groups to determine characters most analogous to characters to be recognized, and the characters to be recognized are allotted the character codes of those characters that have been determined as most analogous.
The character codes for one page of the document which are generated by the character recognition unit 7 are stored in a certain memory together with information representing the positions and size of the characters. The recognized characters are displayed in a format corresponding to the document on a display unit 8 so that the operator can determined whether the recognized result is correct or not.
It is difficult for the operator to place documents in the document reader 1 without the documents sometimes being inclined with respect to the document reader 1. Especially when a thick magazine or the like is placed in the document reader 1, the dot patterns produced by the document reader 1 tend to be inclined, and the character lines cannot be extracted accurately. When a document is inclined with respect to the document reader 1, as shown in FIG. 4, the values of dot patterns projected in the X-axis direction are substantially constant, and the character lines cannot be separated from each other.
A critical angle can be defined as an angle .theta. at which the character lines are inclined with respect to the X-axis, beyond which the character lines can no longer be extracted. The critical angle .theta. is given as follows: EQU Wtan.theta.=d (1)
where W is the length of character lines on a document, and d is the distance between adjacent character lines. This equation (1) indicates that if the distance d is small or the length W is large due to some additional reference characters or marks, then the character lines cannot be extracted even when they are slightly inclined.
In this connection, the first preprocessor 3 shown in FIG. 1 corrects the document out of any rotated condition through calculations effected on an angle, detected by a sensor, through which the document is inclined. The correcting process is however time-consuming, and employs a complex correcting mechanism. The article Document Typing System (2) written by Mariko Takenouchi et al. from collected preprints of 1986 General National Convention of Electronic Communication Society, pages 6-153 (1986), discloses a character extracting algorithm for dividing a document recognition area into subareas and extracting character lines in each of the subareas. However, the article fails to show an efficient method of integrating the extracted character lines in the subareas.