The present invention relates generally to a method for processing handwritten data for training a handwriting recognizer, and more particularly to an automatic method for scoring and clustering prototypes of stroke-based handwriting data.
A handwriting recognizer is a combination of computer hardware and software that accepts handwritten data as input and attempts to match handwritten symbols with known letters and words. Before this can occur, however, the handwriting must be transformed into data that the recognizer can understand and manipulate. This is known as front-end processing of the handwriting data.
In front-end processing, a user must first write on a digitizing tablet or similar hardware device using a special stylus or pen so that the handwriting may be electronically recorded. The handwriting may be described as a time sequence of strokes, where a stroke is the writing from the time the pen is placed down on the tablet until the pen is lifted. Each stroke is recorded as a time series of x- and y-coordinates, called sample points, that represent the path of the pen across the tablet. FIG. 1 is a graphical example of sample points generated from a digitized handwriting sample. The stars in FIG. 1 indicate sample points of the pen taken at uniform time intervals.
Digitized handwriting samples can be characterized as a type of signal that has observable properties. The strokes of a particular sample may vary in both their static and dynamic properties. Static variation occurs in stroke size and shape, while dynamic variation occurs in the number of strokes in a sample and the order in which they are recorded. Handwriting variability stems from the fact that different people, or even the same person, write(s) any given character, symbol, or letter in a variety of ways. The degree of variation depends on the style and speed of writing, with hasty writing usually showing greater variation. It is this variation in handwriting which must be taken into account by a handwriting recognition system.
A common handwriting recognition method analyzes the variation in handwriting by partitioning handwritten words into segments, where a segment is a portion of a stroke. Sequences of segments are then used to identify letters by analyzing the static and dynamic properties of the segments. Application Ser. No. 08/204,031 in the name of the same inventors and the same assignee as the present application, discloses a front-end processing method for extracting both static and dynamic properties from a handwriting sample using non-uniform segmentation and feature extraction.
As disclosed in appl. Ser. No. 08/204,031, the segmentation process partitions strokes into a series of separate segments by defining a segment as the trajectory resulting from a complete upstroke or downstroke of the pen. Stated another way, segment endpoints occur where the pen touches down on the writing surface, leaves the writing surface, or changes vertical direction during writing. FIG. 2 is a graphical example of segmentation and feature extraction performed on the word "act" which is written as a single stroke. Points 20, 22, and 24 shown on the letter "a" are examples of segment endpoints in the stroke forming the word "act". Segment endpoint 20 is the initial starting point of the pen during the first upstroke formed in the letter "a"; segment endpoint 22 is the transition point between the first upstroke and a downstroke; and segment endpoint 24 is the transition point between the downstroke and a second upstroke.
A segment is thus defined as a set of coordinates falling between a pair of segment endpoints. An example of one segment comprising the letter "a" is segment 30, which is described by a list of those coordinates in the letter "a" located between segment endpoint 22 and segment endpoint 24.
Feature extraction refers to the process where static properties, called features, are extracted from the coordinate data of each segment in the stroke. Examples of features extracted from a segment include the net distance between the endpoints of the segment in the x-direction, and the net distance between the endpoints of the segment in the y-direction, shown by the .DELTA.-X and .DELTA.-Y in FIG. 2. Other features extracted from the segment 30, include the coefficients of a third-order polynomial fitted separately to the x- and y-sample points contained in the segment 30. This feature provides information regarding the curvature of the segment 30.
The value of each feature extracted from the segment 30 is stored in a vector, called a feature vector. A feature vector is mathematically represented as F(i)=[fi1, fi2, . . . , fip]; where F stands for feature vector, f is a feature value, i is the number of the current segment, and p is the number of features extracted per segment. The number of features extracted per segment is termed the "dimensionality" of the feature vector. For example, if six features are extracted from the segment, the resulting feature vector exists in six-dimensional feature space. The output of the feature extraction process for a given handwriting sample is a set of feature vectors, where each feature vector corresponds to a segment in the sample. Each set of feature vectors is then used as input to a handwriting recognizer for recognition.
Given an observed set of feature values, the goal of a handwriting recognizer is to determine the most likely character string or letter corresponding to those feature values. One approach to achieve this goal is the use of probabilistic models to characterize statistical properties of a particular signal. The most popular stochastic approach today in handwriting recognition is Hidden Markov Modelling.
In Hidden Markov Modelling, each letter in the alphabet is modeled statistically by a single Hidden Markov Model (HMM). During recognition, the observed set of feature vectors of the letter to be recognized are input to the set of HMMs. Each HMM then calculates the probability that a handwritten version of its corresponding letter could have produced the sequence of feature vectors generated. The letter identified during the recognition process is the letter whose HMM produced the highest probability of producing the observed sequence of feature vectors.
FIG. 3 graphically shows the form of a letter-specific HMM 32. The HMM 32 may be described at any time as being in one of a set of x distinct states, S1, S2, . . . , Sx (where x=3 in FIG. 3). Each state of the HMM 32 is associated with one or more input feature vectors of a particular handwriting sample. Since the observable properties of handwritten data varies over time as a writer forms strokes, the HMM 32 also moves from one state to another over specified time intervals. The HMM 32 is called a left-right model because the underlying state sequence associated with the model has the property that as time increases the states value monotonically increases, or, graphically, the states proceed from left to right.
The changes of state in HMM 32 is determined by a set of transition probabilities associated with each state. Transition probabilities are probabilities that for a given time interval, a state will: stay in the same state, shown by the loop arrows 34a, 34b, and 34c; transition from one state to the next, shown by the arrows 36a and 36b; or by-pass the next state in sequence, shown by the arrows 38a and 38b. The total of all probabilities of the transitions from a particular state equals one. The probability of a given model producing the observed sequence of feature vectors, Fl to Fn, is obtained by multiplying the probabilities of the transitions associated with the trajectory. In addition to transition probabilities, each state S1, S2, and S3, has associated with it statistical information relating to the distribution of feature vectors.
As stated above, each state in the HMM 32 is associated with one or more observed feature vectors. The number of feature vectors mapped to each state equals the total number of feature vectors observed divided by the number of states in the HMM. For example, if six segments were generated from a handwritten letter, and the three-state HMM 32 is used to model the letter, then each state S1, S2, and S3 would correspond to two feature vectors (6 feature vectors.backslash.3 states). The first two feature vectors from the occurrence would map to the first state S1; the second two feature vectors would map to the second state S2; and the last two feature vectors would map to the third state S3.
A beginning state i, and an end state f, of the HMM 32 are not associated with any feature vectors. Arrow 31 represents the transition from the start of the letter, state i, to the first state S1, and arrow 39 represents the transition from the last state S3 to the end of the letter, state f. For more information on the mathematical algorithms behind HMMs, see Lawrence Rabiner, A Tutorial on Hidden Markov Models and selected Applications in Speech Recognition, IEEE, 1989.
For an HMM to become useful in the recognition process, the statistics used in the probability calculations must first be compiled. This is an iterative process called training in which the statistical properties of feature vectors obtained from samples of handwriting, called training data, are analyzed. To obtain reliable statistics for the feature data, the training data must contain a large vocabulary of words written by a large pool of writers to ensure a representative set of feature vectors.
Before feature data is extracted from the training set, the handwriting must be visually inspected to ensure the quality of the training set. Only accurate representations of a particular letter are allowed to remain in the training set, while sloppy or non-representative letters are discarded. Otherwise the statistics generated from the samples would become distorted, resulting in poor recognition. The process of inspecting hundreds or thousands of samples of handwriting is time-consuming, tedious and prone to error when done by visual inspection.
In addition, the current practice of generating one HMM for each letter in the alphabet may be insufficient for proper recognition due to wide variations observed in different occurrences of a given letter due, for instance, to different styles of writing a letter. For a given file of stroke-based handwriting data, the file will contain multiple occurrences of a particular letter, such as the letter "a". Since each occurrence of the letter has different properties, each occurrence will generate a different set of feature vectors. As disclosed in application Ser. No. 08/204,031, multiple vector quantization may then be performed on each feature vector to determine the distribution of the feature vectors in the multidimensional feature space. Each feature vector, corresponding to each segment, is characterized as occupying a single point in the feature space, as shown in FIG. 4A.
FIG. 4A is a graphical representation of a three-dimensional feature space where the four feature vectors P1, P2, P3, and P4, correspond to four segments of a letter. In multiple vector quantization, the distribution of the feature vectors in space corresponding to all occurrences of a letter is then used to generate an HMM for that letter, based on statistical properties of the distribution.
One disadvantage of characterizing feature vectors as fixed points in space for the generation of HMMs is that the feature vectors corresponding to all occurrences of a particular letter may not occupy the feature space in a uniform manner. Certain regions of the space corresponding to a commonly written style of the letter will be highly occupied, and other regions may be sparsely occupied. Consequently, the statistics relating to the sparsely occupied regions of the letter space will be not be adequately represented by the HMM. Since the resulting HMM represents an average between the sparse regions and the dense regions, a loss in resolution of the data will occur. Therefore, the use of only one HMM to statistically model all possible occurrences of a letter may be inadequate for accurate recognition.
A possible solution may be to partition the data space of a letter into subcategories based on upper case occurrences of a letter, and lower case occurrences of the letter. In this intuitive approach, the feature vectors corresponding to upper case letters would be clustered into one group, while the feature vectors corresponding to lower case letters would be clustered into a second group. A separate HMM could then be generated to model the behavior of the respective groups. However, the partitioning of the data space to derive the two separate subcategories would be done based on human concepts of how the data space should be partitioned, rather than based on properties inherent in the feature data.
By definition, a subcategory of a letter denotes similarly written occurrences of the letter which produce similar feature vectors. If every occurrence of a letter generated the same number of segments, then computing the similarity between the occurrences is straightforward, as the following example illustrates.
Assume an occurrence of the letter "a" generated four segments and six features are extracted from each segment. This would result in four features vectors, each containing six values. Referring back to FIG. 4A, these four features vectors, shown as P1, P2, P3, and P4, could then be characterized as a fixed point in a 24-dimensional space (4 segments.times.6features). Assuming a second occurrence of the letter "a" generated four segments and corresponding feature vectors, then the similarity between the two occurrences can be calculated by measuring the Euclidean distance between the two sets of vectors in the 24-dimensional space.
However, based on differences in writing styles, writing speeds, etc., it is known that occurrences of the same letter produce varying number of segments. When occurrences of a letter do not have the same number of segments, the distance between the corresponding features vectors cannot easily be computed due to the different dimensionalities of the data. For example, assume a third occurrence of the letter "a" produces seven segments with corresponding feature vectors. This occurrence of the letter "a" occupies a 42-dimensional space (7 segments.times.6 features), in contrast with the first two occurrences of the letter "a", described above, which occupy a 24-dimensional space. Given occurrences of a letter having differing numbers of feature vectors, a difficulty lies in determining their similarity.
To solve this problem, rather than characterizing a letter occurrence as a single feature vector, the letter occurrence may better be characterized as a sequence of points in space, called a trajectory. It should be noted that in the present specification a "trajectory" of a letter does not refer to the shape of the letter as drawn by the path of a pen. Rather, in the present specification, a "trajectory" is the path in feature space formed by a sequence of feature vectors corresponding to the segments of a letter.
FIG. 4B is a diagram illustrating the trajectory of two occurrences of a letter in three-dimensional feature space, where each trajectory is formed from a sequence of four feature vectors. Although a three-dimensional feature space is shown in FIGS. 4A-4B, in general the feature space will have a higher dimensionality. The first letter occurrence is represented by trajectory Ti which contains four feature vectors, P1, P2, P3, and P4. The second letter occurrence is represented by trajectory T2 which also contains four feature vectors, S1, S2, S3, and S4. The feature vectors for T1 correspond to the feature vectors shown in FIG. 4A. Unlike in FIG. 4A, however, the feature vectors P1, P2, P3, and P4 are not combined to form a single vector, but rather they are combined to form a trajectory through feature space.
The problem in determining the similarity between the two letter occurrences is that the trajectory for the second occurrence T2 is shorter in overall vector length than the trajectory for the first occurrence T1 even though both T1 and T2 depict the same letter. Also compounding the problem is the fact that a trajectory of a first letter occurrence may contain a different number of feature vectors than the trajectory of a second letter occurrence. If the feature vectors of letter occurrences are to be characterized as variable length trajectories, then the similarity of two such trajectories cannot be determined by simply measuring the distance between vectors.
The foregoing discussion illustrates why it has been difficult to determine the existence of subcategories of similar letter occurrences in feature space. The difficulty in determining the existence of subcategories is also the reason why in general only one HMM has been used to model all occurrences of a particular letter.
Accordingly, it is an object of the present invention to provide a method for calculating the similarity between occurrences of a particular letter, and more particularly, it is an object of the present invention to provide an automatic method for discovering from a handwriting training data, subcategories of a particular letter in feature space that correspond to similar letter occurrences.
It is a further object of the present invention to provide a plurality of interacting HMMs for each letter of the alphabet, where each HMM models a subcategory of letter occurrences for the particular letter.
It is another object of the present invention to provide an unsupervised automatic method for screening handwriting data.
Additional objects and advantages of the invention will be set forth in part in the description which follows, and in part become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the claims.