The high computational power that is widely available at present in desktop and even handheld devices has opened up the possibility of widespread use of statistical pattern recognition for a variety of applications. Some examples of applications of statistical pattern recognition include: speech recognition, handwriting recognition, and face recognition. Beyond the foregoing familiar uses, there are numerous specialized applications.
In broad overview, an archetypical statistical pattern recognition system works as follows. First raw data is collected from the thing to be recognized. For example, for face recognition a photograph is taken of a face to be recognized, and for speech recognition, spoken words to be recognized are input through a microphone. The raw data has a relatively high byte size, includes noise and includes a great deal of minor statistical variations. For example, in examining audio waveforms of different or even the same person saying the same word many times, it will be seen that no two waveforms are identical, even though the underlying pattern (e.g., the word to be recognized) is the same. Such variability makes pattern recognition challenging.
Once the raw data has been collected, it is subjected to feature extraction. Typically, the role of feature extraction is to extract essential information from the raw data, by projecting the relatively high-byte-size raw data onto a finite dimensional orthogonal basis. The details of the feature extraction process are outside the focus of the present description. The byte size of the extracted feature vector(s) is generally lower than that of the raw data.
Next, the feature vectors are plugged into a number of statistical pattern recognition models, each representing a different pattern (e.g., face or pronounced word), in order to determine the statistical pattern recognition model that yields the highest probability score, and thereby identify the pattern (e.g., word, face) in the raw data. One example of a statistical pattern recognition model is a mixture (weighted sum) of several multidimensional probability density functions (e.g., Gaussians). This is a kind of generalization of the familiar one-dimensional Gaussian distribution. A mixture is used because a given pattern (e.g., the letter v written in script) appears in two or more variants. For example, words may be written or pronounced in two or more different ways. Note, however, that the use of mixtures to handle multiple variants of each pattern to be recognized also increases the danger that a variant of one pattern (e.g., the written letter v) might be mistaken for a variant of another pattern (e.g., the written letter u), leading to a recognition error.
Prior to using the statistical pattern recognition models in performing recognition, models are ‘trained.’ The object of training is to determine the parameters (e.g., vector means, variances, and mixture component weights) of each particular statistical pattern recognition model, so that each particular model yields the highest probability score compared to other models when evaluated using feature vectors extracted from raw data including the pattern that the model is intended to recognize. Typically, a set of training data samples is used to train each statistical pattern recognition model. The set includes many different versions of the same pattern, for example the word seven spoken by 100 different people.
Typically, each model's parameters are adjusted using an iterative optimization scheme to maximize a summed probability score for the set of training data scores for the model. In order to use training data, the identity of the pattern or patterns present in the training data needs to be known so that it can be used to train the correct model. The identity of each pattern is typically transcribed by a human transcriber. Because large amounts of training data are often used, human error leads to a certain percentage of error in the transcribed identities. Such errors can lead to poor training and degraded recognition performance.
In pattern recognition systems, there is a tradeoff between the computational cost of running a system, which depends on factors such as the dimension of the feature vectors, and the number of mixture components in each statistical pattern model, and the accuracy of the system (i.e., the percent of correct recognitions achieved). In handheld devices in particular, it is desirable to control the computational cost, because high computational cost implies quicker battery depletion. In all pattern recognition systems, it is desirable to improve accuracy. Thus, there is, in general, a desire to improve accuracy and, for handheld devices, to do so without increasing computation cost.