1. Field of the Invention (Technical Field)
The present invention relates to an algorithm for improving speaker identification in lossy channels. Particularly, the present invention is preferably directed to an algorithm that trains a Gaussian Mixture Model (GMM) with several packet loss rate models for each known speaker, and the best speaker match is identified over all the loss model sets.
2. Description of Related Art
Note that the following discussion refers to a number of publications by author(s) and year of publication, and that due to recent publication dates certain publications are not to be considered as prior art vis-a-vis the present invention. Discussion of such publications herein is given for more complete background and is not to be construed as an admission that such publications are prior art for patentability determination purposes.
An objective of speaker identification algorithms is to determine which voice sample from a set of known voice samples best matches the characteristics of an unknown input voice sample. This involves extraction of speaker dependent features from the known voice samples, model building for each known sample, and eventual matching of the features extracted from the unknown voice sample.
Speaker identification systems typically work as follows: prior to speaker identification, the system must first be trained, i.e. create a table associating each individual speaker with a distinguishing set of parameters based on the individual's speech signal; afterward, a new speech signal from an unknown user is acquired and a parameter set is determined; finally, a comparison is made with the unknown individual's parameter set and the entries in the table in order to determine a closest “match” and subsequent identification of the speaker.
Of various speaker identification techniques, the Gaussian mixture model (GMM)-based speaker identification algorithm has shown to be remarkably successful in identifying speakers from a large population. The GMM approach provides a probabilistic model where an implicit segmentation of the speech into phonetic sound classes prior to speaker model training takes place. It is further known that the performance of the GMM-based method is near 100% up to a population size of 630 speakers using the TIMIT speech database (clean speech) with about 24 seconds of training and 6 seconds of test utterances, (see D. Reynolds and R. Rose, “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. Signal Processing, vol. 3, no. 1, pp. 72-83, January 1995). However, the performance degrades significantly for telephone-quality speech and is near only 60% for a similar size population.
Recently, there has been an interest in studying the performance of speaker identification algorithms in the context of mobile wireless channels. It is well known that in order to achieve high transmission efficiency, speech signals in such systems undergo speech coders and decoders which modify the original voice signal. In addition, the uncertain connection strength of wireless channels can cause data packet loss during deep fading periods. Each data packet contains a fixed number of speech samples and the loss of a packet results in the loss of the speech samples contained in the packet. For small packet sizes, these losses can result in degraded accuracy of the speaker identification system.
The effect of GSM (Global System for Mobile Communication) coders on speaker recognition has previously been investigated, (see L. Besacier, S. Grassi, A. Dufaux, M. Ansorge, and F. Pellandini, “GSM speech coding and speaker recognition,” in Proc. IEEE ICASSP'00, June 2000.). It has been shown that the usage of GSM coding significantly degrades performance. By extracting features directly from the encoded bit stream, Besacier et al. were able to improve the performance of the system. However, the effects of packet loss due to the mobile wireless channel has a significant impact on such systems.
U.S. Pat. No. 6,389,392, to Pawlewski et al. discloses a speaker recognition system which makes use of an algorithm which itself relies on Mel Frequency Cestrum Coefficients, overlapping Hamming Windows, Fast Fourier Transforms, and logarithmically spaced triangular band pass filters. The prior art, including that disclosed by Pawlewski et al., fails to teach a system which can be trained with several packet loss models. Further, Pawlewski et al. rely on pattern recognition rather than on statistical analysis for identification. There is thus a need for an invention which improves speaker identification in the presence of packet losses, particularly those losses associated with wireless channels and Voice over IP (VoIP) internet environments.