This invention pertains to a data processing system which recognizes patterns such as characters by sorting the scores of the candidate patterns according to their feature vectors, using an associative matching method.
This invention aims first at determining feature vectors at high speed, even when a recognition device obtains dictionary data in dot units; second at determining scores by performing the association conformance recognition system for a recognition device; and third at sorting inputted data from the highest score at a high speed.
With advances in computer systems, reading devices for receiving image data, for extracting characters from received image data and for recognizing respective characters in sentences of read documents are being put into practical use. These devices divide dot data read (e.g. by an image scanner) into predetermined areas and compare the characters within each divided area with preregistered characters, and output the most similar character as the result.
These predetermined data are generally stored in a dictionary memory which memorizes, for example, featuring data of the respectively defined characters. When a character to be recognized is inputted, it is similarly featured, so that the distance (i.e., the difference) from the predetermined featuring data stored in the dictionary memory is obtained. The character with the least distance is outputted as the recognition result.
Such conventional computer systems for character recognition use a feature vector method for extracting features of data stored in their dictionary memories and the image data of inputted characters. The feature vectors in this method are obtained by determining in dot units the directions of the strokes composing a character, dividing a character area into a plurality of areas, and summing the respective stroke directions in the respective divided areas. The feature vectors are used to improve the recognition rates of the inputted characters.
To determine the number of strokes in respective directions in the divided areas from the obtained stroke data, respective stroke data for the divided areas are read and the values of the registers provided beforehand for the respective directions are incremented. Further, to weight the respective divisions in a character area, the registers provided in correspondence with the stroke directions appropriately weight their values. Processing of all relevant stroke data in dot units often takes a long time. Thus, there is a problem that it takes to long to determine feature vectors.
To improve recognition rates, characters are currently read at a higher resolution. Therefore, a large number of dots are required for recognition. This necessarily increases the amount of dot information for a character and also takes too long.
Such conventional computer systems also determine the distances between the obtained feature data and the feature data memorized beforehand for the respective characters. The feature data are expressed by numeric values representing the features of the respective parts of characters and are memorized in matrices of at least two dimensions. Distances are defined as the sum of the squares of the differences between the values obtained for a character to be recognized and the values memorized in the matrices at respective feature points. The code of the character with the least distance (i.e., the smallest sum of the squares of the differences) is outputted as the recognition result of the inputted character.
When the character preregistered in the dictionary having the smallest distance with the inputted character is determined, the respective distances between the features of all the characters prestored in the dictionary matrices and the features of an inputted character must be calculated. The features are expressed by a large number of values stored in the multi-dimensional matrices, not by single values. Thus, the number of operations for determining the distances by accumulating the squares of the differences between respective features is enormous and takes too much time.
At the same time, such conventional computer systems for character recognition need to rearrange the codes of the preregistered characters to determine the rank of the distances. Also, character areas are divided into a plurality of sub-areas and scores are assigned to the respective sub-areas from the one with the least distance and the scores are summed. Thus, the candidate characters are sequentially obtained from the candidate character with the highest score. Thus, in determining the candidate characters, scores for characters need to be rearranged from the one with the highest score (i.e., from the one with the least distance).
Conventionally, the distances or scores are sorted sequentially to determine the rank of the candidate characters, each time a new distance or score is inputted.
For instance, when the five upper-ranked candidate characters with the highest scores and least distances are obtained, memory for storing the five highest scores and the corresponding character codes is provided. The memory compares the highest memorized score with the inputted score.
When the inputted score is less than the highest memorized score, the inputted score is compared with the second highest memorized score. When the inputted score is less than the second highest memorized score, the inputted score is compared with the third highest memorized score. When the inputted score is less than the third highest memorized score, the inputted score is compared with the fourth highest memorized score. When the inputted score is less than the fourth highest memorized score, the inputted score is compared with the fifth highest memorized score.
If the inputted score is greater than or equal to the memorized score the inputted score and the corresponding character code are stored at the original memory positions for the compared score and the corresponding character code, and the compared score and the corresponding character code as well as the lower scores and their character codes are sequentially shifted to the original memory positions for the lower scores and the corresponding character codes, so that their ranks are sequentially lowered. Since such processings are performed each time a new score and the corresponding character code are inputted, they take too much time.
According to this method, which is called the one-hundred-percent (100%) matching recognition method, to output the character codes with the shortest distances obtained from feature data, the more candidate characters there are, the more time the processings take for their recognitions.
The associative matching recognition method is devised to solve these problems.
According to this method, the character areas to be recognized (i.e., feature areas) are divided, and the representative features for the respective divisions are memorized as classes to which candidate characters belong. Classes similar to the inputted feature vectors for the respective division areas are ranked, and scores commensurate with the ranks are assigned to the classes. Then, the scores of the classes to which candidate characters belong are summed.
However, even the association conformance mode cannot meet more severe requests to improve the processing speed in character recognition, because the recognition processing speed is satisfactory when there is an increased number of characters registered in the dictionary or candidate characters selected for an inputted character to be recognized.