The present invention relates generally to handwriting processing. More particularly, the invention relates to an improved ink matching technique for cursive Chinese characters. The ink matching technique addresses the fact that there is a very large number of characters in the written Chinese language, and that conventional matching techniques, such as would be used to match letters in the English alphabet, are simply not powerful enough for retrieval purposes.
Ink matching works on the basic assumption that electronic ink can serve as first class data that can be matched without attempting to recognize it. In other words, a user may draw a cursive character and the electronic ink data comprising the temporal and spatial information associated with the cursive strokes is used in a searching or matching operation, without the need to interpret what characters those strokes represent.
The improvement described here lies in a powerful radical extraction technique, with associated improvements in dynamic programming matching. The invention is valuable for all Chinese languages, as well as other languages in which Chinese characters are used. Chinese characters are in popular use today throughout much of Asia, including China, Taiwan, Japan, Korea, and elsewhere.
Chinese characters trace their roots to one of humanity""s earliest forms of writing: the pictograph. A single character, which may be made up of one or more sub-characters known as radicals, represents a single word, thought or concept. Thus, unlike the alphabet letters used to spell words in the English language, Chinese characters are more compact. Each represents an entire word, thought or concept in a single character.
While the Chinese character represents a compact and highly efficient form of communication, it has posed problems in automation systems. Entering Chinese characters into the computer is a bottleneck, which some have proposed to solve by using the digitizing tablet instead of a keyboard. The tablet allows a user to enter his or her handwriting into the computer, and handwriting recognition techniques convert the user""s handwriting into recognized text which may then be input into the computer. In the field of handwritten Chinese character recognition, on-line recognition has the advantage of capturing the temporal order of the input points and strokes as they are drawn. Thus, the on-line recognition system can distinguish between a line drawn from top down and a line drawn from bottom up, by the temporal data generated as the lines are drawn.
While great strides have been made in handwritten Chinese character recognition, the complexity of the problem has proven greater than originally anticipated. This is in no small part due to the large number of characters that are needed in a lexicon designed for normal business use. For example, in the written Chinese language, there are on average, three thousand to five thousand characters in daily use. In the written Japanese language, the number of Chinese characters, or xe2x80x9ckanjixe2x80x9d characters, is somewhat smaller. Nevertheless, one still needs to understand several thousand characters in order to read a Japanese newspaper. Furthermore, another difficulty of handwritten character recognition is due to the style of handwriting which varies greatly with writers, and more often people write cursively.
Recently, the ink matching technique has been shown promising in certain applications such as searching of handwritten annotations for document retrieval purpose. For further background, see U.S. Pat. No. 5,832,474, xe2x80x9cDocument Search and Retrieval System with Partial Match Searching of User-Drawn Annotations.xe2x80x9d In these applications, ink matching can be advantageous over traditional character recognition techniques for it handles various styles of handwriting, particularly cursive handwriting. More recently, semantic matching techniques were introduced, where the character is subdivided into its component radicals and those radicals are then pre-searched to reduce the character matching problem to one of a smaller size. By performing a partial match based on the extracted radical, the subsequent character matching process does not need to examine those characters that do not include the matched radical.
While semantic matching based on reference radicals is a very promising technique, the success of the technique hinges upon the system""s ability to extract the radical from the complete character. Radical extraction is a most challenging problem. Because the hand drawn Chinese character is usually drawn free form, without any grid lines, the location of the radical within the character can vary considerably each time the character is drawn. Also, the radical may inadvertently intersect with other parts of the character, making the radical difficult to identify as a separate element within the character. Moreover, when writing quickly, the user may allow the pen to draw inadvertent connecting strokes that join the ending stroke of one part of the character with the beginning stroke of the next part. Although the connecting strokes are not actually part of the character, they are nevertheless present in the stroke data, simply because the user may not have lifted the pen or stylus between drawing individual parts of the character.
Apart from the above issues there is the additional issue of dynamic programming. Dynamic programming is a pattern matching technique that assesses insertion costs, deletion costs and substitution costs in determining whether one pattern (such as a stroke within a radical) matches another pattern. Conventionally, stroke insertion costs and stroke deletion costs are computed in direct proportion to the length of the strokes. Stroke substitution costs are more complicated because they involve measuring the shape similarity between two strokes. Using conventional techniques, a dynamic programming is used to obtain a set of pairs of points between two strokes to be considered for substitution. Substitution cost is measured as the sum of Euclidean distance for each pair of two points. Before the substitution cost is computed, each stroke is temporarily displaced so that the top left corner of the bounding box of that stroke overlaps with the origin (0,0).
The conventional Euclidean-based method for calculating substitution costs has two disadvantages. First, the dynamic programming procedure is time consuming. Second, errors induced using the Euclidean distance technique are cumulative. After the reference stroke and stroke being tested are each displaced so their top left corners overlap the origin, a point by point distance analysis is performed. If the basic shapes of the respective strokes are relatively the same, the Euclidean distance technique can give a useful figure of merit. However, if the respective strokes are essentially the same in shape except for their top left corners, the Euclidean distance technique fails. By aligning the top left corners of the two strokes, the remainder of the strokes find themselves separated at a large Euclidean distance, resulting in an incorrect assessment that the strokes are not similar to one another in shape.
The present invention provides a greatly improved radical extraction technique that will significantly improve the semantic matching performance. As will be more fully discussed herein, the technique employs a character normalization procedure and an improved stroke segmentation procedure that removes a number of problematic artifacts that can confound the dynamic programming procedure.
The technique further employs a new dynamic programming method in which substitution costs are more robustly assessed using tangent vectors, center relationships assessments and starting point-ending point assessments. The dynamic programming procedure further uses a radical categorization data structure that tags the starting and ending radicals for better matching. In addition, the dynamic programming procedure employs a technique whereby dynamic programming costs are normalized using the length of the reference radical and these normalized costs are then interactively adjusted to remove location dissimilarities based on a ranking of the dynamic programming costs.
The result of the improved dynamic programming procedure is an output of the best or n-best extracted radicals. The best extracted radical (or the n-best extracted radicals) may then be used to reduce the size of the lexicon for a subsequent full character match. While the improvements in dynamic programming are particularly useful in radical extraction, they may also be used in the subsequent full character extraction or matching operations.