1. Field of the Invention
The present invention relates to computer-assisted speaker (as opposed to speech) recognition technology. A method and apparatus are disclosed for robust, text-independent (or text-dependent) speaker recognition in which identification of a speaker is based on selected spectral information from the speaker's voice.
2. Description of the Prior Art
Human speech is propagated through the air by speech waves. A speech wave is the response of a speaker's vocal tract system to the glottal source (as in /a/), the friction source (as in /s/), or combinations of the two sources (as in /z/). Acoustically, the vocal tract system is a resonator, whose resonance frequencies are governed by the size and shape of the vocal tract and the positions of its active articulators, such as the tongue.
A speech wave contains phonetic/linguistic information that conveys a particular message, as well as identity information that is uniquely characteristic of a speaker. In general, the pattern of the resonance frequencies determines the phonetic/linguistic content of speech, while speaker identity is correlated with physiological (and also behavioral) characteristics of a speaker.
In the frequency domain, phonetic/linguistic information is mainly confined to the frequency range of approximately 0 to 5 kHz. For example, conventional telephone speech is band-limited from 300 to 3200 Hz, and highly intelligible speech is synthesized from formants in frequencies below 5 kHz. Speaker identity information, on the other hand, is spread over the entire frequency axis. For example, speaker-specific attributes based on glottal sources are mainly confined to the low frequency range, while speaker-specific attributes based on friction sources are carried mainly in the high frequency range. High frequency spectra also contain information about cross-modes of the vocal tract, the size of the larynx tube, and the overall length of the vocal tract.
Yet, despite the above-noted differences in the spectral content of speech, the same spectral information has traditionally been used for work with respect to both speaker recognition and speech recognition technologies. In general, spectral information in the frequency range of 0 to 4 kHz has typically been used in both speaker and speech recognition systems.
Thus, while advancements have been made in computer-assisted speaker recognition during the past two decades, contemporary speaker recognition systems are nevertheless subject to a variety of problems. These problems include: (i) sample variability due to inter-session variations in a speaker's voice; (ii) sample degradation due to environmental interferences, such as room reverberations and room noises; and (iii) imposter attack by persons who can effectively mimic a particular speaker's voice.