1. Field of the Invention
The present invention relates to handwriting recognition systems. More particularly, the present invention relates to a handwriting recognition system and method for unconstrained handwritten cursive words, either created on-line (words written directly on a touch computer screen) or off-line (words written on other media and the word image is then captured for use in the recognition system).
2. Background Art
Unconstrained handwritten words including, in full or part, cursively written characters presents a challenge for handwritten word recognition (HWR) systems because of the variation in writing styles and the difficulty in determining the boundaries between the individual characters. HWR systems generally include an image processing stage, a feature extraction stage, and a classification stage. The image processing stage may generally include image capture for on-line and off-line written words, noise reduction, skew correction, slant normalization and/or segmentation. Generally, there are two information sources associated with the written text: shape information and linguistic information. The goal of the feature extraction stage is to extract or compute representative features of individual characters, a group of consecutive characters, and/or the word. The classification stage usually occurs after the HWR system has been trained on sets of known data (e.g., database of known handwritten words/characters coupled with linguistics or dictionary knowledge) in order to develop the parameters which are used in classifying the imaged word.
Many HWR systems follow a segmentation procedure for segmenting the given word image in order to determine the individual characters using mathematical modeling in the classification stage. An over-segmentation procedure based on morphology and linguistic information is used to translate a 2-D word image into 1-D sequence of sub-character symbols. The imaged word is segmented generously to ensure that genuine segmentation points (i.e., correct segmentation points of cursive character handwritten words) are captured as a subset of all segmentation points. For example, and for illustration purposes only, FIG. 6A shows an image of an Arabic handwritten word, and FIG. 6B illustrates a segmentation of this imaged word segmented to consist of a set of twelve segmentation points. Four are true segmentation points while the other eight are false segmentation points. In this segmentation, small segments or diacritics belonging to a main character body were lumped together as one segment. Accordingly, for illustration in FIG. 6B, similar hatch marks fill these small segments. The sum of true and false segmentation points results in “over-segmentation” of the imaged word.
Features are selected to represent the shape information of character and sub-character symbols, and the sequence of symbols is modeled by a recognition algorithm. Recognition algorithms include Hidden Markov Model (HMM), which has been used in many HWR systems to recognize hand-printed words, but many of these systems have not made any connection of modeling over-segmentation of characters to variable duration states of HMM. Instead, these systems have followed the traditional path of modeling each character by an HMM, and then concatenating character HMMs into a word. By contrast, in a Variable Duration Hidden Markov Model (VDHMM), different images of like characters may be divided into one or more segments, and a probability is obtained that a particular number of combined consecutive segments correctly defines the character. In this context, the variable duration state is used to take care of the segmentation ambiguity among the consecutive characters. Consequently, determining the order of the segments becomes important because, if incorrect, no combination of consecutive segments will yield the true character.
In a given language, diacritics are part of characters and provide features that may be taken into account for character and word recognition. Certain languages, such as Arabic, for example, are not only written as script, but also employ many diacritics. These diacritics should be considered in order to improve the accuracy of the character and word recognition. Moreover, existing HWR systems developed for one language may not be applied in a straightforward manner to another language having different characters in the alphabet and, therefore, needing different representative features of the characters. For example, Arabic presents a particular challenge for HWR systems since character shape is context sensitive and all orthography is cursive. Up to four forms of the same character (isolated, initial, middle and final) exist in Arabic. Further, because of the difficulty associated with segmentation of handwritten text, direct word recognition systems which bypass segmentation and individual character recognition have been considered; these systems, however, work for a fixed lexicon and can not be ported to changing applications.
A VDHMM based HWR system has been proposed in Chen, M.-Y., et al., “Variable duration hidden Markov model and morphological segmentation for handwritten word recognition,” IEEE Trans. on Image Processing 4(12):1675-1687 (1995)(referred to hereinafter as “Ref1”), incorporated herein by reference in its entirety. However, the HWR system developed in Ref1, which was proposed for English handwriting, does not employ a segmentation procedure which recognizes diacritics and other accents or small segments as being associated with its consecutive main character body. Rather, the segmentation procedure may either eliminate or label the diacritics and small segments as being at the beginning or end of a segment sequence of the word or of a group of cursively connected characters of the word (such as when the handwritten word includes a break between cursively connected characters). As a result, it becomes less likely that combining a given number of consecutive segments will yield a correct segment sequence including both the character main body and its associated diacritic/small segment. Furthermore, in many systems, the number of features can vary from 20 to 60; many of these features are discrete with 2 to 4 possible values. In Ref1, all features, irrespective of their discrete nature, are modeled with continuous density.
What is needed, therefore, is a HWR system and method for cursive character handwriting that employs an algorithm that determines a segment sequence in which diacritics and small segments are placed next to their respective main character bodies. Additionally, what is needed is a proper distribution modeling of individual features, with continuous distribution modeling of continuous features, and discrete modeling of discrete features. The present invention satisfies these and other needs such as a comprehensive string matching post-processing algorithm based on empirically derived character confusion information and linguistic constraint, and provides further related advantages, as will be made apparent by the description of the present invention that follows.