Contemporary computing devices allow users to enter handwritten words (e.g., in cursive handwriting and/or printed characters), characters and symbols (e.g., characters in Far East languages). The words, characters and symbols can be used as is, such as to function as readable notes and so forth, or can be converted to text or similar computer codes for more conventional computer uses. To convert to text, for example, as a user writes strokes representing words or other symbols (chirographs) onto a touch-sensitive computer screen or the like, a handwriting recognizer (e.g., trained with millions of samples, employing a dictionary, context and/or other rules) is able to convert the handwriting data into separate characters, dictionary words or symbols. In this way, users are able to enter textual data and/or other computer symbols without necessarily needing a keyboard. Speech recognizers may be arranged to operate in a similar manner.
One type of recognizer returns a list of recognition candidates, each candidate having an associated score corresponding to a probability between zero and one-hundred percent that its associated candidate is correct. For purposes of programming and mathematical convenience, the probability score may be returned as a negative natural log of the probability percentage, with the highest probability candidate having the lowest associated value. Because in this instance a smaller score corresponds to a better match, the score is sometimes referred to as a cost, with the lowest cost indicating the best match.
One such recognizer comprises multiple recognition components, each referred to as an expert. Multiple experts can improve recognition accuracy by having each expert compute various input features and provide a result set of candidates and scores, with a final result set of candidates and scores produced by mathematically combining the result sets of each expert. For example, in a negative natural log configuration, scores from each expert are added together to produce a final result set. In this way, user input is analyzed by multiple experts, which may have very different ways of analyzing (e.g., featurizing) the input to produce their respective alternatives, which can significantly increase recognition accuracy.
One problem with this approach is that instead of improving the overall recognition accuracy relative to one expert's result, the other expert or experts can reduce accuracy. For example, consider handwriting input intended to represent the letter “S” and correctly recognized (i.e., given the lowest cost score) by one expert. Another expert, for example, may recognize the input as most likely being the number “5” with a value that is sufficiently low enough relative to the score for the “S” so as to change the other expert's formerly correct guess when the result sets are combined.
In order to improve overall recognition results, the weight of each expert can be tuned relative to each other expert. A straightforward way to do this is to multiply each expert's result set by a weight constant determined for it, which may be a fraction. Then, when mathematically combining one expert's scores with the scores of one or more other experts, certain of the experts will have less influence on the result. For example, in a two-expert recognizer, one expert can be considered more influential and weighted as one (no multiplier needed), while another expert's results can be halved, i.e., the first expert's score can be summed with half the secondary expert's score to produce its final recognition result set.
A problem with this approach is determining the optimum constant to use as a weight factor, which may need to be determined fairly often, since additional samples may be obtained, or as recognizer technology evolves into new types of experts. While this may seem to be a straightforward empirical experiment, (e.g., try each possible value and see which one best improves overall accuracy on a set of sample data), this is computationally expensive, because with millions of samples, a single test run can take many hours, even with relatively powerful computing devices, and many such parameter values need to be evaluated to find an optimum one. For example, consider tuning a secondary expert (with the other expert not multiplied) by taking every possible multiplying constant (e.g., from 0.001 to 1.000) for that secondary expert, and trying each one against sample set of millions of chirographs to see which constant provides the best overall recognition accuracy. Such a thousand-pass trial may take days or weeks to run, and may have to be repeated each time new samples are obtained or an expert is modified. Moreover, such a trial-and-error solution becomes exponentially more costly with three or more experts.