The present invention relates to the technique of automatically recognizing handwritten Chinese characters, particularly relates to a method and system for automatically segmenting and recognizing Chinese character strings continuously written by a user.
The current information processing systems for accepting a user""s handwriting character input normally comprise a pen-based input means, which is composed of a writing pen and a writing pad. Such a pen-based input means requires that when a user finishes the writing of a Chinese character, he should click a button on the writing pen or writing pad so as to manually segment the handwritten Chinese character strings. The automatic handwritten Chinese characters recognition device can directly recognize the manually segmented Chinese character strings. However, the manual segmentation process affects the user""s handwriting continuity. Therefore, this handwriting mode is not adapted to the user""s handwriting habit.
IBM""s ThinkScribe is a device integrating a handwriting digitizer with a traditional paper-based recording system. This device records a user""s handwriting input in strokes and associated time sequence and can reproduce the user""s handwriting input according to the original time sequence. When users write Chinese characters on ThinkScribe, they usually write characters continuously with little or without any space in-between characters. And sometimes, users even overlap strokes of adjacent characters or connect the last stroke of the preceding character with the first stroke of the latter character. This makes the character segmentation a problem before recognition.
At present, there are no effective character segmentation methods. The handwritten Chinese character recognition technique can only recognize isolated Chinese characters or handwritten Chinese character strings with big spaces. The difficulties of automatically segmenting handwritten Chinese character strings lie in:
1) Many Chinese characters have separable components lined up from left to right. When writing quickly in a horizontal line from left to right, the distance between such components may be similar to that between two characters. In addition to this spatial confusion, the left and right parts of those characters are often themselves single characters, or may resemble some characters. Similar statements can be made for Chinese characters written in a vertical line, since many Chinese characters have separable components stacked up from top to down.
2) For adjacent characters, when writing cursively, the end stroke of the first character and the beginning stroke of the second character may not be cleanly separated with each other.
Thus, how to overcome the above difficulties and provide a method for automatically segmenting Chinese character string continuously written by a user are the bases for realizing the automatic recognition of the continuously handwritten Chinese character string.
The method according to the present invention for automatically segmenting and recognizing handwritten Chinese character strings takes advantage of the information derived from different sources to realize the automatic segmentation and recognition of continuously handwritten Chinese character, such as writing habits, geometric characteristics of Chinese character strings, time sequence information and language model at different levels.
The method according to the present invention for automatically segmenting and recognizing handwritten Chinese character strings comprises the following steps:
creating a geometry model which describes the geometric characteristics of stroke sequences of handwritten Chinese character strings and a language model which describes the dependency among Chinese characters or words;
finding out all of potential segmentation schemes in Chinese character strings continuously written by a user based on said associated timing information and said geometry model;
recognizing the groups of strokes as defined by each of potential segmentation schemes and computing the probability characterizing the exactness of the recognition result;
correcting the probability characterizing the exactness of the recognition result by said language model; and
selecting the recognition result having the maximum probability value and the corresponding segmentation scheme as the segmentation and recognition result of the Chinese character strings continuously written by a user.
The system according to the present invention for automatically segmenting and recognizing handwritten Chinese character strings comprises:
input means, for accepting Chinese character strings continuously written by a user, and recording the user input in strokes and the associated timing information;
model storage means, for storing a geometry model which describes the geometric characteristics of stroke sequences in handwritten Chinese character strings and a language model which describes the dependency among Chinese characters or words;
segmenting means, for finding out all of potential segmentation schemes in the Chinese character strings continuously written by a user based on said associated timing information and said geometry model;
recognizing means, for recognizing the groups of strokes as defined by each of potential segmentation schemes and computing the probability characterizing the exactness of the recognition result; and
arbitrating means, for correcting the probability characterizing the exactness of the recognition result by said language model; and selecting the recognition result and the corresponding segmentation scheme having the maximum probability value as the segmentation and recognition result of the Chinese character strings continuously written by a user.