One technique for character recognition is to establish a decision tree in which different pixels are successively examined, and branches are taken at each pixel examination based on whether the pixel is black or white. Such decision trees are known and described in the art.
An optical character recognition device is generally only capable of recognizing characters printed in a font that the OCR device has been trained to recognize. Typically, an OCR device is programmed by its manufacturer to recognize a few common type fonts. But, with the proliferation of laser printers and the widespread ability to generate numerous type fonts, many users desire the ability to train their OCR devices to recognize characters in a new font, different from any included by the manufacturer.
To design a decision tree for an OCR device to recognize characters in a particular font, the probability, for a given character, that a given pixel in a block of pixels representing that character is black is needed to identify pixels useful in differentiating that character from others. An approach to decision tree design is described in Casey et al., "Decision Tree Design Using a Probabilistic Model," IEEE Transactions on Information Theory, Vol. IT-30, No. 1, pp. 93-99 (1983).
("Character" is used herein to identify the printed or written symbol to be recognized by the OCR device. A character is typically a letter, a numeral, or some other symbol. It is to be recognized that "character" may also refer to a class of symbols, as when two symbols are very similar and the OCR decision logic does not necessarily distinguish between them, e.g. the numeral "1" and lower case "1" in same type fonts. Also, the following description refers to black and white pixels, although other combinations of distinguishable colors may also be used. Further, in the context of black and white pixels, methods and apparatus are described based on black pixels and probabilities of black. Decision tree logic can also be based on white pixels and probabilities of white.)
To obtain the probabilities of black for individual pixels in each character, hundreds of samples of each character in the font to be recognized must be examined. Collecting and identifying the samples is time consuming and expensive. Often, such large number of samples may not be available.
Thus, a need exists for a technique for designing OCR decision trees with very few (ideally only one) training samples of each character.
A conventional decision tree generation process is shown in FIG. 1. A large plurality of training samples in the font to be recognized is printed. The samples are scanned to generate an array of pixels. For each character or class of characters to be identified, all the samples of that character or class of characters are superimposed, and the probability that a particular pixel is black is counted as the number of times that pixel is black in the samples divided by the number of samples. From such probabilities, certain pixels can be selected for use in generating a decision tree to use in recognizing characters in that font. This procedure usually requires an extremely large number of samples for each character or class of characters, typically on the order of 100 to 200 samples, for accurate character recognition.