1. Field of the Invention
The present invention relates to a character recognition method and apparatus for recognizing a character image on a paper sheet.
2. Description of the Related Art
A character recognition apparatus is suitable for an information processing apparatus which performs formation of electronic documents or data base, document processing, automatic translation, and the like of a huge volume of printed documents, and has been extensively studied and developed.
A conventional character recognition system for recognizing a character image printed or handwritten on a paper sheet normally comprises (1) document image input processing, (2) character extraction processing, (3) pre-processing (smoothing, normalization, thin-line conversion, and the like), (4) feature extraction processing, (5) rough classification processing, (6) fine classification processing, (7) post processing, and the like.
In such a character recognition system, character images on a paper sheet are read as an optical image, and the optical image is converted into an electrical signal. The character images read into the system are extracted into predetermined recognition units, e.g., in units of characters, on the basis of, e.g., the histogram of the marginal distribution. Thereafter, the extracted characters are subjected to the pre-processing to allow efficient recognition. In the feature extraction processing, the features of input characters, such as topological features, features in units of pixels divided into a mesh pattern, and the like are extracted so as to perform the recognition processing using the structure analysis method, the pattern matching method, or the like. The rough classification processing is especially used for, e.g., Kanji characters with a large number of character categories, and limits or narrows down candidate categories by a simple method. In the fine classification processing, more detailed recognition processing is performed for the limited candidates. Furthermore, in the post processing, when candidates cannot be determined by the recognition processing of individual input characters, neighboring input characters are coupled, and are discriminated as a character string with reference to, e.g., a pertinent grammar.
The conventional character recognition system suffers the following problems.
In the character extraction processing, when a plurality of discrete characters successively appear, extraction errors tend to occur. This is one of major factors which determine the precision of character recognition. As an effective countermeasure against this problem, a method of improving character extraction precision in association with recognition is known. However, this method requires a long time for recognition.
In the feature extraction processing, a character image which is normalized to a predetermined size is scanned in units of pixels (bits), and the feature vector (obtained by numerically expressing the features) of a character is extracted in consideration of the relationship between the scanned pixel and its neighboring pixels. Therefore, the feature vector obtained in this way is easily influenced by noise. Such a feature amount expresses the overall or total feature of each character but does not always reveal the outstanding feature of each character.
In the character recognition used in the conventional rough classification or fine classification, pattern matching (a kind of distance calculation) for measuring the total degree of similarity between an unknown input character expressed by the feature vector and a standard pattern in a dictionary is performed, and a proper number of candidate types are selected in the order of smaller distances. In such a conventional method, since the distance scale is the only criterion for the classification, the outstanding structural features of each character cannot be flexibly utilized in the process of the classification. For this reason, in the conventional method, an unknown input pattern must be compared with standard patterns of all the character types. For example, when the number of character types is 5,000, distance calculations are required 5,000 times. This drawback is the most serious obstacle to high-speed character recognition. Furthermore, it is difficult to check the validity of the classification or recognition results in the process of character recognition, and the checking and correction operations of recognition errors are entrusted to the post processing, e.g., collation with a huge word dictionary. This drawback is also a serious obstacle to high-speed processing.
On the other hand, Jpn. Pat. Appln. KOKAI Publication Nos. 63-15383 (pattern collation apparatus), 63-118993 (character recognition method), and 63-131287 (character recognition system) attempt to realize high-speed processing by realizing high-speed rough classification by a method which is not associated with any distance calculations and by performing distance calculations for only a small number of limited candidate character types. However, in these methods, since a character image is scanned, and features are extracted by a method of checking neighboring pixels in units of pixels (one dot) as basic means, the extracted features are easily influenced by noise, and it is very difficult to limit candidate character types with high precision using such a feature vector.
As described above, the conventional character recognition method and apparatus cannot be satisfactorily put into high-grade practical applications in terms of their noise resistance, recognition speed, and recognition precision due to the nature of the recognition processing to be used.