Computers accept human user input in various ways. One of the most common input devices is the keyboard. Additional types of input mechanisms include mice and other pointing devices. Although useful for many purposes, keyboards and mice (as well as other pointing devices) sometimes lack flexibility. For example, many persons find it easier to write, take notes, etc. with a pen and paper instead of a keyboard. Mice and other types of pointing devices do not generally provide a true substitute for pen and paper. Traditional input device limitations are even more acute with regard to East Asian languages. As used herein, “East Asian” includes, but is not limited to, written languages such Japanese, Chinese and Korean. Written forms of these languages contain thousands of characters, and specialized keyboards for these languages can be cumbersome and require specialized training to properly use.
Electronic tablets or other types of electronic writing devices offer an attractive alternative to keyboards and mice. These devices typically include a stylus with which a user can write upon a display screen in a manner similar to using a pen and paper. A digitizer nested within the display converts movement of the stylus across the display into an “electronic ink” representation of the user's writing. The electronic ink is stored as coordinate values for a collection of points along the line(s) drawn by the user. Software may then be used to analyze the electronic ink to recognize characters, and then convert the electronic ink to Unicode, ASCII or other code values for what the user has written.
There are many handwriting recognition systems in use employing various algorithms to map handwritten data to characters. One such system is described in commonly-owned U.S. Pat. No. 5,729,629 ('629 patent), titled “Handwritten Symbol Recognizer,” which patent is incorporated by reference herein. The described recognizer is useful for, e.g., recognition of East Asian language characters. The recognizer implements template matching for characters written in multiple strokes, to map an input character to a Unicode or other value for the ink character. Each input stroke of a character is described by a five-dimensional feature vector representing the x and y coordinates of the stroke start and end points, together with a feature code corresponding to the overall shape of the stroke (e.g., vertical line, horizontal line, counterclockwise arc, etc.). The recognizer measures a Euclidian Vector Distance between each input stroke and a stroke of a stored reference character (or “prototype”). The database of prototypes is divided into multiple groupings (or “spaces”) based on the number of features in the prototype. For example, a 5-space contains prototypes having five features.
Another recognizer, which is similar in many respects to the recognizer described in the '629 patent, is described in commonly-owned U.S. Pat. No. 6,094,506 ('506 patent), titled “Automatic Generation of Probability Tables for Handwriting Recognition Systems,” which patent is also incorporated by reference herein. In that recognizer, each stoke of an input character is also described by a five-dimensional vector representing a feature code for the stroke and the x and y coordinates of stroke start and end points. The input character is then compared against every prototype in a database having the same number of strokes as the input character. To perform this comparison, a Shape Feature Probability Matrix (SFPM) is created in which each possible shape feature corresponds to a row and to a column. Each entry in the SFPM represents a probability that, for any two characters having s strokes and having shape features fi and fj at position p (where fi is the feature code for the input stroke, fj is the feature code for the prototype stroke and p=1, 2, . . . s), the characters are the same. A Position Feature Probability Table (PFPT) is also generated. The PFPT is a one-dimensional array containing one entry for each possible feature distance, and which is indexed by feature distance. The feature distance is calculated as (xjp1−xip1)2+(xjp2−xip2)2+(yjp1−yip1)2+(yjp2−yip2)2, where (xip1, yip2) are the starting and end points for stroke p of the input ink. Each entry in the PFPT represents a probability that, for any two characters having s strokes and a feature distance D between strokes at the same position p, the characters are the same. During recognition, each input character is compared to each prototype by comparing the strokes of the input character to respective strokes of the prototype (i.e., the first stroke of the input character is compared against the first stroke of the prototype, then the second stroke of the input character is compared against the second stroke of the prototype, and so on). Using the SFPM, a first number is computed by summing values obtained by indexing the first input and prototype strokes, by indexing the second input and prototype strokes, etc. Using the PFPT, a second number is computed by summing values indexed by the feature distances between the first input and prototype strokes, between the second input and prototype strokes, etc. A Match Probability value equals the sum of these first and second numbers. The prototype for which a comparison against the input character results in the highest probability of a match is considered the best match. As described in the '506 patent, the SFPM and PFPT values are based on a negative logarithmic function of the probability. Thus, the lowest Match Probability value corresponds to the highest probability of match.
The recognizers described by the '629 and '506 patents work by matching an input ink character having s strokes only against prototypes having the same number of strokes, i.e., in the s space. For characters written in print form, this presents few problems. As used herein, “print” refers to a writing style in which a user attempts to create a character so as to mimic a standardized format, and is distinguished from machine-printed characters (e.g., typed, computer generated font, etc.). Although there are variations in the relative position and shape of strokes for a given handwritten printed character, different users generally print the character using the same number of strokes.
Challenges arise in connection with recognizing cursive handwriting. Often, a cursive representation of a particular character will connect two or more strokes into a single stroke. Strokes may also be skipped and/or rounded in cursive handwriting. In theory, a character written in s strokes in print form can be written in 1 to s strokes in cursive form. This is illustrated in FIG. 1, which shows the Simplified Chinese character having Unicode code point U+9752 (phonetically “qing,” meaning “green”). The character is shown in standard form on the left side of the figure, and in progressively more cursive variations toward the right side of the figure.
East Asian languages generally contain many characters that, when written in print, contain a large number of strokes. In cursive writing, these strokes are often joined together. But there are many different patterns in which strokes may be joined. To accurately recognize these various patterns, a large number of training samples would need to be collected. The number of stroke-connection patterns, however, is prohibitively large, such that data collection would be too expensive and training a recognizer would be too time consuming. Accordingly, techniques for segmenting cursive handwriting to more closely resemble print handwriting so that a handwriting recognizer, which has been trained mainly on print data, can process the segmented cursive writing would be desirable.