The invention relates to a speaker recognition system suitable for identifying or verifying the speaker of an input voice at an on-line terminal or the like. More particularly, it is directed to a speaker recognition system using a neural network.
The term "speaker recognition" means the recognition of a speaker from an input voice and comes in two forms: speaker identification and speaker verification.
The term "speaker identification" means a judgment on who an input voice represents among registered speakers, while the term "speaker verification" means a judgment on whether or not the input voice can be recognized as the voice of a registered speaker.
Conventional speaker recognition systems are proposed in e.g., Japanese Patent Examined Publication No. 13956/1981 and the Transactions of the Institute of Electronics and Communication Engineers of Japan, Nov. 1973, Vol. 56-A No. 11 (Reference 1).
The result of supplementary tests conducted on the conventional speaker recognition system disclosed in Reference 1 will be described with reference to FIG. 1.
The high frequency components of an input voice are cut (eliminated) by a 4.2 kHz low-pass filter (LPF) (Step 101), and sampled at a cycle of 10 kHz and quantized in 16 bits (Step 102). Then, blocks of 25.6 msec are extracted at a cycle of 12.8 msec to set a frame (Step 103). After multiplied by a humming window (Step 104), the input voice is subjected to a PARCOR (partial self-correlation) analysis. And a block containing the voice sound is detected, and the pitch and the PARCOR coefficient are extracted (Step 105). From the analysis result, an average, a standard deviation, and a correlation matrix are calculated (Step 106), and a feature quantity specific to a speaker included in the input voice is extracted from these data (Step 107).
Then, distances between the standard patterns of respective registered speakers which have similarly been extracted in advance and an input evaluation pattern are calculated (Step 108).
For speaker identification, a speaker who corresponds to the standard pattern whose distance from the input evaluation pattern is the shortest is judged to be the speaker of the input voice, while for speaker verification, the speaker of the input voice is judged to be an unregistered speaker if the distances from the standard patterns of all the speakers exceed a predetermined threshold (Step 109).
Further, the feature quantity disclosed in Japanese Patent Examined Publication No. 13956/1981 includes a correlation between spectral parameters calculated from an input voice, an average of the respective parameters, and a standard deviation.
However, the conventional speaker recognition systems exhibit impairment in recognition rate as the time elapses (e.g., hours or days from the creation of the standard patterns if only a single word is used for their judgment. Reference 1 presents an exemplary case where the speaker identification rate is decreased from 100% to 85% and where the speaker verification rate is decreased from 99% to 91% after three months from the creation of the standard patterns.
To ensure acceptable rates, a plurality of words (about 4 words) must be inputted, which is disadvantageously time-consuming in feature quantity extraction and distance calculation (about 30 seconds), further making real-time processing difficult.