1. Field of the Invention
This invention relates to automatic pattern recognition in which an unknown input is compared to reference data representative of allowed patterns and the unknown input is identified as the most likely reference pattern.
2. Description of Related Art
Reference data for each member of a set of allowed patterns is stored and a test input compared with the reference data to recognise the input pattern. An important factor to consider in automatic pattern recognition is that of undesired variations in characteristics, for instance in speech or handwriting due to time-localised anomalous events. The anomalies can have different forms such as the communication channel, environmental noise, uncharacteristic sounds from speakers, unmodelled writing conditions etc. The resultant variations cause a mismatch between the corresponding test and reference patterns which in turn can lead to a significant reduction in the recognition accuracy.
The invention has particular, although not exclusive, application to automatic speaker recognition. Speaker recognition covers both the task of speaker identification and speaker verification. In the former case, the task is to identify an unknown speaker as one from a pre-determined set of speakers; in the latter case, the task is to verify that a person is the person they claim to be, again from a pre-determined set of speakers. Hereinafter reference will be made to the field of speaker recognition but the technique is applicable to other fields of pattern recognition.
To improve robustness in automatic speaker recognition, a reference model is usually based on a number of repetitions of the training utterance recorded in multiple sessions. The aim is to increase the possibility of capturing the recording conditions and speaking behaviour which are close to those of the testing through at least one of the utterance repetitions in the training set. The enrolled speaker may then be represented using a single reference model formed by combining the given training utterance repetitions. A potential disadvantage of the above approach is that a training utterance repetition which is very different from the test utterance may corrupt the combined model and hence seriously affect the verification performance. An alternative method is to represent each registered speaker using multiple reference models. However, since the level of mismatch normally varies across the utterance, the improvement achieved in this way may not be significant.
The methods developed previously for introducing robustness into the speaker verification operation have been mainly based on the normalisation of verification scores. The development of these methods has been a direct result of the probabilistic modelling of speakers as described in the article by M. J. Carey and E. S. Parris, xe2x80x9cSpeaker Verificationxe2x80x9d, Proceedings of the Institute of Acoustics (UK), vol. 18, pp. 99-106, 1996 and an article by N. S. Jayant, xe2x80x9cA Study of Statistical Pattern Verificationxe2x80x9d, IEEE Transaction on Systems, Man, and Cybernetics, vol. SMC-2, pp. 238-246, 1972. By adopting this method of modelling and using Bayes theorem, the verification score can be expressed as a likelihood ratio. i.e.       Verification    ⁢          xe2x80x83        ⁢    Score    =            likelihood      ⁢              xe2x80x83            ⁢              (        score        )            ⁢              xe2x80x83            ⁢      for      ⁢              xe2x80x83            ⁢      the      ⁢              xe2x80x83            ⁢      target      ⁢              xe2x80x83            ⁢      speaker              likelihood      ⁢              xe2x80x83            ⁢              (        score        )            ⁢              xe2x80x83            ⁢      for      ⁢              xe2x80x83            ⁢      any      ⁢              xe2x80x83            ⁢      speaker      
The above expression can be viewed as obtaining the verification score by normalising the score for the target speaker.
A well known normalisation method is that based on the use of a general (speaker-independent) reference model formed by using utterances from a large population of speakers M. J. Carey and E. S. Parris, xe2x80x9cSpeaker Verification Using Connected Wordsxe2x80x9d, Proceedings of the Institute of Acoustics (UK), vol. 14, pp. 95-100, 1992. In this method, the score for the general model is used for normalising the score for the target speaker. Another effective method in this category involves calculating a statistic of scores for a cohort of speakers, and using this to normalise the score for the target speaker as described in A. E. Rosenberg, J. Delong, C. H. Lee, B. H. Huang, and F. K. Soong, xe2x80x9cThe Use of Cohort Normalised Scores for Speaker Verificationxe2x80x9d, Proc. ICSLP, pp. 599-602, 1992 and an article by T. Matsui and S. Furui, xe2x80x9cConcatenated Phoneme Models for Text-Variable Speaker Recognitionxe2x80x9d, Proc. ICASSP, pp. 391-394, 1993. The normalisation methods essentially operate on the assumption that the mismatch is uniform across the given utterance. Based on this assumption, first, the score for the target speaker is calculated using the complete utterance. Then this score is scaled by a certain factor depending on the particular method used.
The invention seeks to reduce the adverse effects of variation in patterns.
In accordance with the invention there is provided a method of pattern recognition.
Thus the invention relies on representing allowed patterns using segmented multiple reference models and minimising the mismatch between the test and reference patterns. This is achieved by using the best segments from the collection of models for each pattern to form a complete reference template.
Preferably the mismatch associated with each individual segment is then estimated and this information is then used to compute a weighting factor for correcting each segmental distance prior to the calculation of the final distance.