The speaker recognition has certain advantages over other forms of identification or verification. The speaker recognition is a general term for voice identification and voice verification. The voice identification identifies a particular individual from a group of individuals based upon a voice input. The voice verification verifies that a voice input belongs to a particular individual. In either process, generally, an individual initially registers his or her voice by uttering words such as his or her name. To later identify or verify the individual, the individual utters the same or other words. The voice input is compared to the registered voice to determine a match. Since speaker recognition generally does not require a speaker to memorize any code such a personal identification number (PIN), the speaker recognition process is user-friendly. Furthermore, the identification information is more susceptible to theft and fraud. For example, a criminal learns a PIN for telephone calls by looking over the shoulder of a bona-fide user. Similarly, a wrong doer can steal the PIN for automatic teller machines for banking. To substantially reduce theft and fraud, the above described user friendly speaker recognition process is advantageous over the conventional PIN-based identification process.
Despite the above described advantages, the speaker recognition faces a number of difficulties in implementing a reliable system. The difficulties generally originate from the nature of the voice data. The voice data requires a large amount of memory. Furthermore, the voice input is susceptible to changes over time, due to input devices or under certain physical conditions. Human voice is not constant over a long period of time. Human voice is also affected by a speaker's physical condition such as a cold. Lastly, during the digital conversion, input devices such as a microphone and an analog-to-digital converter affect the digital voice data.
To overcome these problem, a series of voice parameters is extracted as intermediate voice characteristic information from the digitally converted voice input data, and then final voice characteristic information such as a voice characteristic pattern is generated from the extracted voice parameter values. The voice parameters include a spectrum, a cepstrum and a LPC (linear prediction coefficient). Briefly, the spectrum is obtain a Formant frequency of the voice tract based upon frequency transformed voice data. The cepstrum is a result of Fourier transform or inverse Fourier transform, and the Formant frequency reflects both the vocal tract and the vocal chord. A LPC is a coefficient obtained using the linear prediction method based upon the assumption that the vocal tract does not generate any antiresonance. Regardless of the above described parameters, five to fifteen voice parameters are determined for each time period, and the time period generally ranges from 10 to 20 millisecond. Based upon the above described voice parameters, the voice characteristic pattern is generated as final voice characteristic information to save the storage space as well as to improve the reliability of the speaker recognition process. For more detailed description of these voice parameters, "Digital Voice Processing" (in Japanese) by Sadaoki Furui (1985) is incorporated herein by external reference.
In using the above described voice recognition process in the computer or telephone network, additional concerns need to be addressed. The computer network generally includes a large number of independent processing units such as personal computers. In a commercial application, the voice recognition provider must be able to communicate to these independent processing units at remote sites. For example, in a banking transaction system, a central computer must be able to communicate to automatic teller machines and identify or verify a user based upon a voice input in a reliable and speedy manner. However, none of the relevant prior art is directed to the network application of the speaker recognition. Prior art references such as Japanese Patent 1-302297 is directed to a certain security feature of speaker recognition while Japanese Patent 57-104193 is directed to updating the previously registered voice data.
The current invention is thus directed to the method and system for recognizing a speaker based upon a voice input over a network in a reliable and efficient manner.