Automatic algorithms for speaker recognition have been developed as early as 1963. In one early study by S. Pruzansky, described in "Pattern-Matching Procedure For Automatic Talker Recognition", J. Acoust. Soc. Amer., vol. 35, pp. 354-358, 1963, the information contained in digital spectrograms of words excerpted from sentences was used as the feature set in comparing utterances of different speakers. In a further study on the same data, a methodology was proposed for selecting features formed by averaging the speech energy within rectangular areas of the spectrograms of selected words. The general conclusion was that the effectiveness of such features is reduced by averaging across frequency and increased by averaging across time. Despite the counterintuitive nature of these findings, the merit of long-term averaging was confirmed in a more recent and much larger study by S. Furui, "Comparison Of Speaker Recognition Methods Using Statistical Features And Dynamic Features", IEEE Trans. ASSP, vol. 29, pp. 342-350, June 1981.
In 1974, studies investigated the use of linear predictive parameters as features for speaker recognition. This work was motivated by the observation that the predictor coefficients, together with pitch and voicing, were sufficient to regenerate high-quality synthetic speech. A minimum distance rule was used for recognition and the inference vectors were long-term time averages of the various linear predictive parameters. Cepstral coefficients were found to give the best accuracy. On clean speech, speaker recognition methods based on linear prediction were shown in one study by R. E. Wohlford, E. H. Wrench, and B. P. Landell, "A Comparison Of Four Techniques For Automatic Speaker Recognition", Proc. IEEE ICASSP, no. 3, pp. 908-911, 1980, to outperform methods based on power spectral or cepstral analysis. This work was done on studio-quality speech, and more recent work indicates that this conclusion does not hold for telephone speech. A difficulty in applying linear predictive analysis to telephone speech is that such analysis is highly sensitive to narrow-bandwidth signals (tones), even when they are of very low amplitude.
In 1985, studies by P. Rajasekaran and G. Doddington, "Speech Recognition In The F-16 Cockpit Using Principal Spectral Components", Proc. ICASSP-85, vol. 2. p. 882, 1985, performed linear prediction analysis by the autocorrelation method. It then performs Fourier spectral analysis on the impulse response of the prediction filter, and computes a filterbank representation of the power spectral envelope. The principal spectral components method was later applied to speaker verification by J. Naik and G. Doddington, "Evaluation Of High Performance Speaker Verification System For Access Control", Proc. ICASSP, pp. 2392-2395, 1987. Improved accuracy was obtained by replacing the principal components transformation with a "clustering" transformation. Similar findings have been reported in other studies. The motivation of this approach is that it selects features based on their discrimination power rather than their variance.
A number of studies have been done to select a relatively small set of features that efficiently characterizes a speaker. In one study, it was found that the fundamental frequency parameters led all others in terms of discrimination ratio. This conclusion was shown to be invalid in a later study, which found practically no discrimination power in fundamental frequency. The later study showed the importance to speech data when appreciable time elapsed between the training and test samples. One study in the area of speech recognition by C. Olano, "An Investigation Of Spectral Match Statistics Using A Phonetically Marked Data Base", Proc. ICASSP-83, 1983, describes an evaluation procedure for developing a filterbank and match statistic to be used in a keyword recognition system.
In general, the research work in speaker recognition was focused at an early stage on developing "compact" and "efficient" measurements and models. A principal problem is how to handle extremely large numbers of statistics without exceeding the capabilities of speech processing machines, even despite the enormous increases in computer memory and processing capacity of machines today. In the long run, however, efforts to develop parsimonious representations may have only limited progress. Instead, a primary goal should be to improve performance, as opposed to reducing computation.
The early work in speaker recognition, such as that of Pruzansky discussed above, represented speakers as points in a multi-dimensional measurement space. New speech material to be recognized was processed to form another point in this same space. Recognition was performed by determining the distance to each speaker, and selecting the speaker with the minimum distance. A significant variation of the minimum-distance approach was reported by K. P. Li and E. H. Wrench, "An Approach To Text-Independent Speaker Recognition With Short Utterances", Proc. IEEE ICASSP, pp. 262-265, April 1983. Each speaker was modeled as a set of 40 points in the measurement space. These 40 points were frames from the speaker's training material that were chosen based on histograms of their distances to other training frames of the same speaker and different speakers. The system used a multi-modal, rather than a uni-modal, speaker model, minimized the effect of speech events not observed in the training data, and chose each speaker's model to discriminate that speaker from other speakers.
A similar method developed by F. Soong, A. Rosenberg, L. Rabiner, and B. Juang, "A Vector Quantization Approach To Speaker Recognition", Proc. ICASSP-85, Vol. 1, pp. 387-390; 1985, was based on vector quantization. A vector quantization codebook was created for each speaker from a clustering analysis of the training data. Test speech was then encoded using all codebooks, accumulating the coding distortion for each frame. The speaker whose codebook gave the lowest average distortion was recognized. A. L. Higgins and R. E. Wohlford, "A New Method Of Text Independent Speaker Recognition", Proc. ICASSP-86, vol. 2, no. 1, April 1986, developed a generalization of this method in which the individual frames in the codebook were replaced by multiple-frame segments. Recognition was performed by a template-matching algorithm, where the multiple-frame segments were used as templates. This method was compared with the vector quantization method and shown to give better text-independent speaker recognition accuracy.
The weaknesses of all these variants of minimum-distance multi-modal modeling are: (1) they require estimation of many free parameters, (2) the number of training examples associated with each mode is uncontrolled, and (3) the significance of deviations between the input speech and the modes is, unknown. The template-matching method is found to be quite sensitive to noise and other channel distortions.
The first use of statistical models in speaker recognition was reported in a study by A. B. Poritz, "Linear Predictive Hidden Markov Models And The Speech Signal", Proc. ICASSP-82, pp. 1291-1294, 1982. Speech production was modeled as an underlying, or hidden, Markov process in which outputs were generated with various probability density functions (PDFs) depending on the state of the model. Parameters of this hidden Markov model were assumed to be characteristic of the speaker.
In work by R. Schwartz, S. Roucos, and M. Berouti, statistical models were developed in which each speaker was characterized by a single multivariate PDF of individual frame feature vectors. They applied standard statistical pattern recognition procedures to estimation of PDFs and classification of speakers. In one method, each speaker's PDF was assumed to be multivariate Gaussian, and its parameters, the mean, and the covariance matrix, were estimated. The probability of each unknown frame was evaluated using this PDF, and the product of probabilities over all unknown frames was computed. The result was an estimate, assuming independent frames, of the likelihood of the observed unknown being produced by the model speaker.
Another method tested was a non-parametric PDF estimation technique. This technique computed the distance between each unknown frame and all the frames of training data for a given speaker. The likelihood of the frame was estimated using Parzen's method multiplied by a weighting factor. The Parzen estimate of the local density is inversely proportional to the volume of a hypersphere centered at the test point and just enclosing the kth-nearest training frame, where k is chosen depending on the total number of training frames. Again, the product of probabilities over all unknown frames was computed to produce an estimate, assuming independent frames, of the likelihood of the observed unknown being produced by the model speaker. The two probabilistic classifiers produced comparable recognition accuracy, which was clearly better than that of a minimum-distance method. However, it is found that the accuracy of Parzen estimates deteriorates rapidly as the dimensionality of the measurement space increases, so that the non-parametric is not found to be better than the parametric method.
Recently, R. Rose and D. Reynolds, "Text Independent Speaker Identification Using Automatic Acoustic Segmentation", Proc. ICASSP, pp. 293-296, 1990, reported development of a speaker identification algorithm based on parametric PDF estimation using a Gaussian mixture model. The density model is a weighted linear combination of Gaussian densities, each with a common diagonal covariance matrix. For a given number of component Gaussian densities, the algorithm estimates the weights, means, and variance terms using an unsupervised iterative maximum likelihood technique. The simulation results were excellent, outperforming the standard unimodal Gaussian model by a large margin.
Speaker verification was first tested using telephone channels in 1975 by A. B. Rosenberg, "Evaluation Of An Automatic Speaker-Verification System Over Telephone Lines", Bell System Tech. J., vol. 55, pp. 723-744, 1976. The system's initial rejection rates were about 20-25%, probably because enrollment was done in a single session. As the system was used, template adaptation reduced the error rate to a stable value of about 5%. A telephone line impairment that was observed to be definitely detrimental was pulse-like background noise. Such noises can be induced at the telephone exchange by switching currents in neighboring equipment. An excellent report on the signal degradations resulting from telephone transmission and their effects on speech recognition was written by L. S. Moye, in "Study Of The Effects On Speech Analysis Of The Types Of Degradation Occurring In Telephony", STL Monograph, pp. 1-238, Standard Telecommunication Laboratories, Harlow, England, July 1979.
H. Gish, et al., in "Methods And Experiments Of Text-Independent Speaker Recognition Over Telephone Channels", Proc. ICASSP-86, April 1986, reported on text-independent speaker recognition using the Gaussian modeling technique discussed above. When speaker testing data was taken from the same telephone call as their training data, accuracy in one test was 90%. When the same test was repeated using training and test data from different calls, accuracy decreased to 61%. About one-third of this loss was recovered through a combination of channel modeling, in which a correction factor was applied to the model covariance used in the mean term of the likelihood equation, and experimental adjustment of the weights assigned to the mean and covariance terms. Recently,, Gish developed a speaker identification method based on a metric for comparison of covariance matrices, described in "Robust Discrimination In Automatic Speaker Identification", Proc. ICASSP, pp. 289-292, 1990. Computationally, the metric involves a summation over a number of terms. Gish reports on the performance of a family of such metrics, which differ in the terms included in this summation. Excellent closed-set identification performance is shown for the best metric. There is, however, no way to determine a-priori which specific metric is best to use in a particular case. A further difficulty is that the method is a minimum-distance method, providing no information about the likelihood of the observations for a given speaker.