The present invention relates to methods and systems for machine-based recognition of the source of acoustic phenomena from the acoustic phenomena. More particularly, the present invention relates to methods and systems for machine-based recognition in which there may be a mismatch among acoustic input devices (e.g., telephone handsets) used during testing and during training. A particularly appropriate application of the present invention is speaker recognition, i.e., recognition of the identity of a speaker by the speaker's voice.
In speaker recognition, including telephone-based speaker recognition, it has been widely recognized that classification performance degrades due to corruptions of the signal in the transmission channel. Furthermore, it has been shown that one of the most significant contributors to performance degradation of speaker recognition systems is a mismatch in acoustic input device types between training and testing (e.g., training on carbon-microphone telephone handsets but testing on electret-microphone telephone handsets). See D. A. Reynolds, "The Effects of Handset Variability on Speaker Recognition Performance: Experiments on the Switchboard Corpus," Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing ("ICASSP") (1996), hereinafter referred to as "Reynolds '96." In the present specification, the word "handset" will frequently be used for convenience of expression to mean any type of acoustic input device, including those that are not actually hand-held.
Speaker recognition systems have, in the past, made use of well-established techniques to compensate for channel distortions. Some of these techniques are described, for example, in the following references: B. Atal, "Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification," J. Acoust. Soc. of Am., pp. 1304-1312 (1974) hereinafter referred to as "Atal '74"!; F. Soong and A. Rosenberg, "On The Use Of Instantaneous And Transitional Spectral Information In Speaker Recognition," IEEE Trans. on Acoustics, Speech, and Signal Processing ("ASSP"), vol. ASSP-36, pp. 871-879 (June 1988) hereinafter referred to as "Soong '88"!; and H. Hermansky et al., "RASTA-PLP Speech Analysis Technique," Proc. of IEEE ICASSP (1992) hereinafter referred to as "Hermansky '92"!.
Compensation techniques which have been used for speaker recognition include cepstral mean subtraction (Atal '74), using delta coefficients as acoustic features (Soong '88), and RASTA filtering of speech data (Hermansky '92). While systems using techniques such as the above can effectively compensate for linear channel distortions in the frequency domain, they are generally less effective in treating the handset mismatch problem. The reason for this lack of effectiveness is that handsets tend to have transfer characteristics which introduce distortions other than mere linear distortions in the frequency domain.
What is needed in the field of speaker recognition is a system that can maintain robust discrimination even in the presence of mismatch among the types of handsets used to process speech for recognition and for training speech models.