The present application is directed to data classification, and more particularly to use of data classification in data recognition systems, including, but not limited to, optical character recognition (OCR) systems.
Two approaches to data classification which have been extensively studied in the machine learning literature are generative models and discriminative models. Generative models learn a joint probability density function, p(x; y), between data (x) and their labels y, or equivalently the likelihood p(x/y) and prior p(y) states. In the latter case, Bayes' rule can be applied to obtain the posterior distribution p(y/x), which is maximized to predict the class labels of new data. On the other hand, discriminative models either estimate the posterior p(y/x) directly, or they compute decision boundaries between different classes.
It is considered that in many applications that, compared to generative models, discriminative models are easier to train and can achieve higher classification accuracy. However, generative models have their own advantages which include: (1) establishing an intrinsic linkage to hidden variables or missing data in an Expectation Maximization (EM) learning and inference framework; (2) being flexible and robust enough to handle complicated scenarios, such as detecting visual objects in cluttered backgrounds with occlusion; (3) often having superior performance to discriminative models when training data sets are small; (4) being suited to incremental learning, for example, whenever a new class emerges, or an existing class model needs updating, training can be conducted only on the relevant portion of the data. By contrast, discriminative models have to be re-trained with all of the data to adapt to the changes.
The above suggests that it is advantageous to combine the two complementary models into a hybrid framework which is not only flexible in learning, but also has high performance in terms of prediction accuracy and computational efficiency. There are several examples of uses which employ discriminative and generative methods together. For example, T. S. Jaakkola and D. Haussler describe such use for classifiers in the article, “Exploiting Generative Models In Discriminative Classifiers”, Neural Information Processing Systems (NIPS) 11, 487-493, 1998. In K. Tsuda, M. Kawanabe and K-R Muller's article, “Clustering with the Fisher score”, Neural Information Processing Systems (NIPS), 2002, Fisher scores are obtained from a learned generative model and are used for classification and clustering purposes. The article by S. Tong and D. Koller, “Restricted Bayes Optimal Classifiers”, National Conference on Artificial Intelligence (AAAI), 2000, proposed a notion of restricted Bayes optimal classifiers in which the Bayes optimal classifiers are adjusted according to maximum margin classification criteria, and the article by R. Raina, Y. Shen, A. Y. Ng and A. McCallum, entitled “Classification With Hybrid Generative/Discriminative Models”, Neural Information Processing Systems (NIPS) 2003, employed naive Bayes and logistic regression as a “generative-discriminative” pair and applied it to document classification. However, the above concepts are not particularly applicable to improving the accuracy and case of use of optical character recognition issues, to which the present application is directed.
Most commercial optical character recognition (OCR) tools focus on general character shapes and are not flexible enough to adapt to specific application settings, especially on images with noise and clutter such as shown in FIG. 1. Re-trainable font-specific approaches appear to provide the greatest accuracy when the font is known. However, training example preparation is usually requires a highly skilled technician, and even then it is a tedious and often prohibitively expensive manual effort. Therefore, recent research in the field has focused on ease of training preparation, especially on noisy images. P. Sarkar and H. S. Baird, in “Decoder Banks: Versatility, Automation, And High Accuracy Without Supervised Training”, Int'l Conf. on Pattern Recognition (ICPR), volume 2, 646-649, 2004, used a decoder bank that was composed of an array of pre-trained fonts, to avoid supervised training. To remove the need for ground truth in training, H. Ma and D. Doermann, in “Adaptive OCR With Limited User Feedback”, Int'l Conf. on Document Analysis and Recognition (ICDAR), 814-818, 2005, proposed a methodology to cluster images of the same glyph, while J. Edwards and D. Forsyth, in an article entitled, “Searching for Character Models”, Neural Information Processing Systems (NIPS), 2005, iteratively improved a character model by gathering new training data from high confidence regions. Ground truth is intended herein to refer to training data (examples) which are correctly labeled according to the categories they fall into.
In studying a commercial setting where an OCR solution is needed for images with printed characters that vary only slightly in font shape, but include severe degradation, as shown in FIG. 1, it has been found that template-based techniques have superior robustness to noise and clutter. Nevertheless, template based techniques have their own limitations. A font template solution that is based on the independent bit flip model was proposed by G. E. Kopec in “Multilevel Character Templates For Document Image Decoding”, Document Recognition IV, SPIE 3027, 1997. This solution is, however, found to be too sensitive to variations in font shapes and degradations. If this drawback is attempted to be addressed by a decoder bank of templates that are trained for all possible font variations and degradations, then the tedious and difficult tasks of categorizing glyph images according to fonts and degradations need to be undertaken, which lowers the practicability of such a system.
The above considerations have therefore made it appear useful to search for a better scheme that is less sensitive to variation in fonts and degradations.