1. Field of the Invention
The present invention relates to a speaker recognition system or similar apparatus which applies adaptive weighting to spectral components in each frame of speech for normalizing the spectrum of speech, thereby reducing irrelevant variations such as communication channel effects. It also introduces a frame selection scheme by which only speaker dependent frames with relatively high segmental signal to noise ratio are selected.
2. Description of the Related Art
The objective of a speaker identification system is to determine which speaker is present from an utterance. Alternatively, the objective of a speaker verification system is to verify the speaker's claimed identity from an utterance. Speaker identification and speaker verification systems can be defined in the general category of speaker recognition.
It is known that typical telephone switching systems often route calls between the same starting and ending locations on different channels. A spectrum of speech determined on each of the channels can have a different shape due to the effects of the channel. Recognition of voices on different channels is therefore difficult because of the variances in the spectrum of speech due to non-formant spectral components.
Conventional methods have attempted to normalize the spectrum of speech to correct for the spectral shape. U.S. Pat. No. 5,001,761 describes a device for normalizing speech around a certain frequency which has a noise effect. A spectrum of speech is divided at the predetermined frequency. A linear approximate line for each of the divided spectrum is determined and approximate lines are joined at the predetermined frequency for normalizing the spectrum. This device has the drawback that each frame of speech is only normalized for the predetermined frequency having the noise effect and the frame of speech is not normalized for reducing non-formant components effects which can occur over a range of frequencies in the spectrum.
U.S. Pat. No. 4,926,488 describes a method for normalizing speech to enhance spoken input in order to account for noise which accompanies the speech signal. This method generates feature vectors of the speech. A feature vector is normalized by an operator function which includes a number of parameters. A closest prototype vector is determined for the normalized vector and the operator function is altered to move the normalized vector closer to the closest prototype. The altered operator vector is applied to the next feature vector in the transforming thereof to a normalized vector. This patent has the limitation that it does not account for non-formant components effects which might occur over more than one frequency.
Speech has conventionally been modeled in a manner that mimics the human vocal tract. Linear predictive coding (LPC) has be used for describing short segments of speech using parameters which can be transformed into a spectrum of positions (frequencies) and shapes (bandwidths) of peaks in the spectral envelope of the speech segments. LPC cepstral coefficients represent the inverse z transform of the logarithm of the LPC spectrum of a signal. Cepstral coefficients can be derived from the frequency spectrum or from linear predictive LP coefficients. Cepstral coefficients can be used as dominant features for speaker recognition.
It has been found that a reduced set of cepstral coefficients can be used for synthesizing or recognizing speech. U.S. Pat. No. 5,165,008 describes a method for synthesizing speech in which five cepstral coefficients are used for each segment of speaker independent data. The set of five cepstral coefficients is determined by linear predictive analysis in order to determine a coefficient weighting factor. The coefficient weighting factor minimizes a non-squared prediction error of each element of a vector in the vocal tract resource space. The same coefficient weighting factors are applied to each frame of speech and do not account for the spectral variations resulting from the effects of non-formant components.
It is desirable to provide a set of spectral features that emphasize the vocal-tract (formant) information while reducing the non-vocal tract effects by applying an adaptive weighting scheme to the components of the short-time spectrum of the speech signal. For this adaptive weighting scheme, it is preferable to provide a frame selection criterion that only chooses frames that exhibit specific spectral characteristics. Such frames correspond to voiced sounds which have most of the speaker characteristic information.