1. Field of the Invention
This invention relates to a character recognition apparatus, more specifically to a character extraction apparatus for extracting characters from a text, a dictionary production apparatus for producing dictionaries used in character recognition, and a character recognition apparatus using both apparatuses.
2. Description of the Prior Art
Recently, there has been demand for character recognition apparatuses so that many such apparatuses have been brought onto the market. However, these conventional devices have been unable to meet the increasing demands for character recognition accuracy.
There are two major conventional methods for extracting characters from text images which have been input and through light/electricity conversion: a method with histogram of character rows, and a method of extracting arrays of black pixels.
A first character extraction apparatus with the histogram method comprises a character row extracting means, a histogram extracting means, and a character extracting means. The character row extracting means extracts position data for each character row from a text image. The histogram extracting means vertically and horizontally (character row direction) scans an area specified by the position data and its surroundings, and counts the number of black pixels for each scanning line. Thus, data for a histogram of black pixels is obtained for each character row area. The character extracting means, based on the obtained histogram data and using an assumed average character width, extracts characters from the character row.
A second character extraction apparatus with a method of extracting arrays of black pixels comprises a character row extracting means, an array extracting means, and a character extracting means. The character row extracting means extracts position data of character rows from a text image. The array extracting means extracts arrays of black pixels by putting labels to the arrays. More specifically, the array extracting means scans a character row area, and extracts arrays of black pixels in which black pixels are connected to each other in four vertical and horizontal directions or eight directions including diagonal directions, then assigns a same label to every pixel which belongs to a same array. The character extracting means assigns respective labels to the arrays. (Hideyuki Tamura, "Guide to Image Processing on Computer", Soken Shuppan, page 75.) The character extracting means segments the character row into characters with labels having their respective data.
In either of the above apparatuses, a character recognition means receives data for each character, and uses the data to recognize characters.
However, the first character extraction apparatus does not correctly extract characters if there are "gearing" characters or if character widths are not constant.
A "gearing" represents an overlapping of circumscribed rectangles of characters. For example, in word "modifying", a horizontal gearing may happen to characters "f" and "y". The overlapping may also happen to Italic type characters. Vertical gearing also may happen to characters such as "g" and "p" which have longer lower part. As for difference in character widths, characters such as "i" and "l" have shorter widths than "a", for example.
Even though the second character extraction apparatus has been made to overcome the defects of marginal histogram method, sometimes the apparatus does not extract characters such as "i" and "j", German characters with umlaut mark, or "broken" characters which miss some part due to obscure text image because they are divided into several parts.
Generally, character recognition apparatus extracts a set of pieces of feature data, referred to hereinafter as a "feature", from character data of each recognition object, and identify a character by comparing the feature with standard feature for character models stored in a built-in dictionary and selecting the most similar one.
The dictionary is made by a dictionary production apparatus. The dictionary production apparatus obtains average values of features from a lot of handwritten characters and different types of print characters for each character model, and stores the average values as standard features in the dictionary.
However, the dictionary is not sufficient to recognize handwritten characters which vary greatly at each performance even by a same writer or multi-font characters for which a plurality of shapes and sizes are available at printing. That is, the distribution of features are so complicated to deal with.
A dictionary generating apparatus using cluster analysis has been proposed to overcome the defect. The cluster analysis is a known general method for classifying different target objects, solid or numeral, into clusters based on a defined similarity ("Methods of Multivariate Statistical Analysis", Yutaka Tanaka and Kazumasa Wakimoto, Gendai-sugaku-sha, pp 230 to 244). The apparatus makes clusters of feature data values for each character. Average feature data values are obtained for each cluster and stored into a dictionary. As the number of clusters for a character increases, the recognition accuracy increases, but dictionary capacity increases as well.
However, in view of effective use of storage devices or execution time for recognizing characters, the less the number of clusters, the better the effectiveness. This is because the character recognition apparatus computes distances between features and average features of clusters. Therefore, as the number of clusters stored in the dictionary increases, the time taken for character recognition increases.
The above conditions, being contradictory to each other, determine the optimum number of clusters, that is the minimum number of clusters for realizing the maximum recognition ratio. However, to obtain the optimum number of clusters of a character, distribution of all the features of all the characters must be grasped because the number of clusters is affected by relation between features of different characters as well as affected by distribution of features of a same character.
However, grasping the distribution of all the features of all the characters and determining the optimum number of clusters for each character is not a practical way because a vast number of calculations are required for a vast number of source characters used to make a dictionary.
A first prior-art apparatus with cluster analysis, therefore, sets the same number of clusters for each character.
A second prior-art apparatus with cluster analysis (Japanese Laid-Open Patent Application No.1-36388) makes a dictionary by setting a same number of clusters, then arranges the number of clusters after experimenting character recognition with the dictionary.
A third prior-art apparatus with cluster analysis (Japanese Patent Publication No.5-082628) discloses a feature obtained for a cluster. The apparatus obtains a circumscribed rectangle of each character image, and divides the circumscribed rectangle into blocks by dividing it into L pieces horizontally and M pieces vertically. The apparatus assigns a direction value to each boundary pixel of the character image based on a direction to an adjacent boundary pixel. The apparatus obtains an outline direction density of each block by counting points of direction values in the block. The apparatus divides the circumscribed rectangle into other blocks by dividing it into P pieces horizontally and Q pieces vertically. Then, the apparatus obtains a background value by scanning the circumscribed rectangle from one side to the opposite side incrementing a value for each encounter with a black pixel, and obtains background density of each block by counting pixels of background values in the block.
A fourth prior-art apparatus with cluster analysis (Japanese Laid-Open Patent Application No.5-128307) provides a method for dealing with broken characters and connected characters. The apparatus recognizes extracted characters as independent characters. The apparatus also links independent characters for every combination unless the width of the linked ones does not exceeds a predetermined width. The apparatus then evaluates both independent characters and linked characters, compares them, and selects one with highest evaluation, and identifies the character.
Furthermore, a first prior-art character recognition apparatus provides a method for discerning pictures and drawings. The apparatus judges whether each "fragment" of a picture or drawing is a character.
However, these prior-art apparatuses have problems to be solved. The first prior-art apparatus has low character recognition ratio if the number of clusters per character is small. Therefore, a lot of clusters are required to increase the recognition ratio. However, the more the clusters, the more unnecessary computations. Furthermore, even if the number of clusters is increased, the recognition ratio does not reach a satisfactory level.
The second prior-art apparatus has a problem in discerning a character from another with almost the same shape, such as "0" and "O". Therefore, the recognition ratio does not increase even if the number of clusters is increased.
The problem of the third prior-art apparatus is that it recognizes characters with different shapes as the same character if they have the same pixel succession in the outlines, such as ".cndot." and ".vertline.".
The fourth prior-art apparatus takes a lot of time to recognize characters because it links independent characters for every combination unless the width of-the linked ones does not exceeds a predetermined width.
The first prior-art character recognition apparatus may recognize a fragment of a picture or a drawing as a character if the shape of the fragment happens to be the same as that of the character.
A second prior-art character recognition apparatus (Japanese Laid-Open Patent Application No.63-216189) provides a method for discerning a character from another with almost the same shape, such as "0" and "O". It is difficult to recognize such characters just by comparing them with patterns in a dictionary for matching. A recognizing unit recognizes extracted characters. Then, a recognition controlling unit obtains a threshold value for the character height from differences between the uppermost positions and the lowest positions of words. The recognition controlling unit puts a label to every character by comparing the character height with the threshold value to indicate a capitol letter or a small letter. The recognition controlling unit, based on the labels, corrects characters recognized by the recognizing unit. For example, if the recognizing unit recognizes a character as "O", and the recognition controlling unit judges it to be "o", the recognition controlling unit corrects the character from "O" to "o".
However, the second prior-art character recognition apparatus still has a problem in differentiating characters with similar shapes, such as small letter "l" and capitol letter "I" or small letters "w" and "m" because the apparatus compares characters only with the threshold value for the character height. Also, if word shapes bend in text reading by a scanner, the uppermost positions and the lowest positions of the words are not correct, and incorrect labels are generated. As a result, the character recognition accuracy of the apparatus has not reached a satisfactory level.