Conventional spectrograms have, for decades, been employed in the forensic context as a means of inferring the identity of a speaker. However, aspects of the aural-spectrographic method have been challenged by a number of speech scientists; early reports of 99% accuracy in identifying speakers from just four words have not been replicated; and an early report commissioned by the F.B.I. in 1979 warned that the assumption of interspeaker variability exceeding intraspeaker variability was not adequately supported by scientific theory and data. As a result, use of conventional spectrograms for voice identification have not received general approval for use in court proceedings.
The reassigned spectrogram, a lesser-known method of imaging the time-frequency spectral information contained in a signal, offers some distinct advantages over the conventional spectrogram. Reassigned spectrograms are able to show the instantaneous frequencies of signal components as well as the occurrence of impulses with increased precision compared to conventional spectrograms (i.e. the magnitude of the short-time Fourier transform (STFT) or other calculated transform). Computed from the partial phase derivatives (with respect to time and frequency) of the transform, such spectrograms are here shown to reveal unique features of an individual's phonatory process by “zooming in” on a few glottal pulsations during a vowel. These images can thus highlight the individuating information in the signal and exclude most distracting information. To date, no attempt has been made to apply this newer technology to the problems of speaker identification or verification.
One writer in 1970 envisioned “a device with a display emphasizing those sound features that are most dependent on the speaker. The patterns could then be judged with greater confidence by human experts.” No such device has since been developed or discussed before the one discussed herein. Another author discussed movies of vocal fold vibration which “show large variations in the movement of the vocal folds from one individual to another.” It is precisely these individual differences that are at the root of the methods set forth herein.
Speaker identification and verification may be divided into two fundamentally different approaches. The earliest approach was initially christened “voiceprinting” (Kersta 1962), but has been subsumed under the rubric of the “aural-spectrographic method” (e.g. Rose 2002). The voiceprinting technique attempted to use spectrograms of words or longer utterances to identify a person by the overall appearance of the spectrogram. The method has recently suffered withering criticism from the vast majority of credible experts. Major points against the procedure that have been singled out are: (i) no self-described voiceprinting expert was ever able to adequately state or outline the specific criteria that were used to declare a match between a pair of spectrographic voiceprints; (ii) spectrograms of words show too much linguistic information (which is largely the same across different speakers), and not enough known individuating information; (iii) because there is no articulated theory of how to match such voiceprints, the procedure is inextricably tied to the human expert, and could never be automated by any conceivable procedure. In the end, Rose (2002) went so far as to claim that there is no way for a voiceprinting procedure to work even in principle, because no one has demonstrated that a person's voice provides a single individuating biometric image—analogous to a fingerprint—by any known computational procedure.
A more recent approach to speaker identification and verification was developed by leveraging the statistical sound-matching techniques that were originally developed for automatic speech recognition. In this approach, continuous speech is analyzed frame by frame (with each one about 25 ms in length), but a spectrogram was never utilized. Instead, a cepstrum was computed for each frame, yielding a relatively small number (on the order of 24) values called cepstral coefficients. A sophisticated statistical model, known as a Gaussian mixture model, is then computed from the frames by techniques now standard in the art (Quatieri 2002). To apply the technique to speaker identification, a Gaussian model is computed for each speaker in the database. Statistical comparison techniques are then applied to new speech in order to determine if there is a probable matching speaker in the database. Recent improvements to the technology employ a modified cepstrum known as the mel-frequency cepstrum which adheres to the human auditory frequency analysis scale.
The Gaussian modeling approach to speaker identification using mel-frequency cepstral coefficients is in many ways everything that the voiceprinting process was not. It is completely specified and algorithmically defined. It is a completely automatic procedure requiring no human intervention (and indeed, provides no opportunity for human interpretation of the results). The process does not compute anything analogous to a voiceprint—a complete statistical model of the speaker's speech is developed instead. Images are not compared. The process is also reasonably successful, with near 100% accuracy on small speaker populations recorded under excellent conditions. Its accuracy breaks down considerably under more realistic acoustics, however, and generally it has proven impossible to achieve equal error rates (equal false positive and false negative matches) of less than 10% under such “normal” conditions.
The Gaussian mixture modeling of speakers using mel-frequency cepstral coefficients appears to represent the current state-of-the-art in speaker identification. Efforts to improve the procedure have made some incremental improvements to the baseline, but the whole paradigm appears to have reached a performance ceiling below that which is acceptable for most identification and verification purposes such as forensic analysis or secured access control. As promising as the Gaussian approach may be, something completely different is called for to augment this procedure at the very least (and perhaps replace it entirely under appropriate circumstances).
One proposal for closing the performance gap involves somehow extracting information about the “fine structure” of the phonation cycle (Plumpe et al. 1999) to augment the broad Gaussian speaker model obtained from the longer utterances. In this approach, some promising speaker identification results were obtained by simply computing the glottal flow derivative and submitting it to automatic classification. However, no spectral analysis was performed at this short time resolution.
It is therefore desirable to provide voice identification and verification methods that provide a high degree of accuracy (minimal false positive and minimal false negative matches) under normal acoustical conditions.