Apparatus and methods of optical character recognition (OCR) have existed for over thirty years. During that time, improvements in OCR capabilities and reliabilities, and reductions in cost, have been primarily the result of improvements in equipment. Methods of processing data for OCR have traditionally recognized unknown characters of a known character set by considering all symbols as being equally important. They commonly utilize techniques which distinguish all possible symbols from all other symbols in a parallel, rather than serial manner.
The present invention describes a new method and apparatus for processing data in which the most frequently occurring symbols are deemed the most important. An ordered sequence of recognition steps is applied to the data representing the unknown characters to first recognize only those characters of a group having a higher frequency of occurrence. Groups of characters having lower frequencies of occurrence are recognized subsequently. It is an ideal method for microprocessor implementation because one may accomplish the same recognition result with fewer steps by distinguishing the data in several small sets of characters instead of the complete set of characters. For instance, by first recognizing a small set of characters having the greatest likelihood or frequency of occurrence, the average number of recognition steps is substantially reduced. This method and apparatus is especially applicable for recognizing character fonts that contain a large number of symbols, such as Japanese Katakana or English text.
Briefly, the method comprises the application of an ordered sequence of independent sets of discriminatory tests that operate and are arranged based on the frequency of occurrence of certain subsets of the character symbols. This is to be compared with the prior art type of tree-style decision logic illustrated in FIG. 2. The prior art tree logic attempts to recognize all symbols with a single recognition logic in a unified process, regardless of how frequently or infrequently each individual symbol occurs in normal language usage. Specifically, the character image data is tested progressively by a series of binary test forming a decision tree, where each test is represented as a node and can be made on individual image bits or collective image data known as measurements or features. The tests continue until a particular character of a defined set of characters has been unambiguously identified, as at node 27. Data not identified continues down the tree until it is identified or a reject code is generated signifying that the data was not recognized. Further, more comprehensive tests can then be applied, again attempting to distinguish among all possible characters. The design of such a decision tree logic system, the method of processing the data, and the equipment implementing such a method increases exponentially in cost and complexity with increases in the recognition rate and recognition accuracy. Further, it is an inherent characteristic of the tree-style decision logic that it is keyed to the characteristics of the entire population of characters, rather than discrete subsets thereof. Each succeeding test is dependent upon the preceding test in an attempt to recognize all characters in the set as opposed to recognizing only a subset of characters based upon their frequency of occurrence. Thus, one may expend a disproportionate amount of computing time or utilize a disproportionate amount of equipment distinguishing among characters having a low frequency of occurrence, which seriously degrades the utilization efficiency of the machine without any significant improvement in the recognition rate, accuracy or performance. An example of such a tree-style logic is disclosed in IBM Technical Disclosure Bulletin Vol. 23, No. 8, January 1981.
The present invention utilizes a novel ordered sequence of recognition logic sets rather than the parallel arrangement found in the typical tree-style decision logic. Each logic set is uniquely tailored to recognize the characteristics of only a select population or subset of all of the characters. Thus, each set of tests is not dependent upon the preceding set of tests and is not required to take into account all of the characteristics of the entire population of characters. This optimizes both recognition performance and processing time.
In addition, for character sets having a large number of symbols this method and apparatus provides a convenient means of decomposing the recognition problem into subsets of smaller problems from which optimum benefit may be obtained. The smaller recognition sets also allow the recognition problem to become a more manageable task for the designer of the recognition logic, and dramatically decreases the number of recogition steps. Thus, this method permits the application of a multiplicity of independent character recognition logic sets uniquely adapted to identify controlled subsets of the entire character set. Thus, one may optimize equipment use and processing time to obtain a higher recognition rate of the more frequently occurring characters in a shorter length of time. By freeing the recognition logic from the constraints imposed in the prior art tree-style decision logic, the amount of logic is minimized while focusing the recognition efforts on the most frequently occurring characters. Moreover, the recognition accuracy and throughput is increased without an increase in storage area. Further advantages include the use of simpler and discrete recognition logic sets rather than a seemingly endless and interrelated set of cascaded tests. Therefore, one may optimize the various logic sets to recognize differing subsets with the bulk of the effort concentrated on those characters most frequently occurring. For instance, the accuracy of the recognition rate for the more frequently occurring characters may be improved by including additional tests that are not necessary or desirable to improve the accuracy of the recognition rate for the less frequently occurring characters.
Further advantages include the ability to tailor the method of processing to specific needs. For instance, while the letters X, Y and Z may be among the least frequently occurring characters when the English language is used for everyday communication, those letters may be among the most frequently used in scientific or mathematical applications. The method of the present invention has the flexibility to accommodate such changes. To further enhance the method and apparatus of the present invention, the concatenation of the recognition logic sets may occur in either or both the horizontal and vertical directions. Accordingly, it is an object of the present invention to provide a method and apparatus for processing data that improves optical character recognizing efficiencies through improvements in accuracy and speed of throughput.
It is a further object of the present invention to provide a method and apparatus for processing data that recognizes unknown characters of a known character set based in part upon the frequency of occurrence of the characters.
It is a further object of the present invention to provide a method and apparatus for processing data for recognizing unknown characters of a known character set by applying a series of discrete stages of discriminatory tests to recognize the image data.