1. Field of the Invention
This invention relates generally to acoustical signal analysis and, more particularly, to a system and method for verifying the identity of a person, recognizing speech or sounds, and/or normalizing communication channels through dynamic time/frequency warping.
2. Description of the Related Art
Speech or sound patterns may be represented by a set of measurements of acoustic features or spectral models. Such measurements of acoustic features or spectral models may be used for various purposes such as verifying the identity of a speaker, recognizing speech or speech patterns (i.e. "keyword spotting"), and normalizing channels through which acoustic signals are transmitted.
In general, different speakers will produce sounds having different spectral components. Accordingly, the size and shape of the vocal tract dictates location and band width of speech formats. In general, a phoneme spoken by the same person twice should have about the same spectral information. However, if the same phoneme is spoken by different people, the spectral information will be distorted due to differences in the size and shape of the two vocal tracts.
In the past, comparisons have been made between a reference utterance of a speaker with a test utterance of the speaker to determine whether or not the two utterances were spoken by the same individual. This is often referred to as speaker verification. Similarly, such comparisons have been used to recognize particular words (including "keyword spotting") spoken by any individual. This is often referred to as speech recognition. However, such comparisons in the past have been insufficient to achieve meaningful speaker verification and speech recognition.
In the past, such comparisons have been insufficient because of an inability to focus on the problems associated with analyses of such distorted spectral information. An utterance, which is to be recognized or verified, is generally more complex than steady sound. Therefore, speech patterns often involve sequences of short-time acoustic representations. Pattern comparison techniques for speech recognition and the speaker verification, therefore, must compare sequences of acoustic features. Problems associated with spatial sequence comparison for speech comes from the fact that different acoustic renditions of the same speech utterance, such as words, phrases or sentences, are seldom realized at the same temporal rate (speaking range) across the entire utterance. Therefore, when comparing different acoustic representations of the same utterance, speaking variations as well as duration variations should not contribute to the comparison of the test utterance to the reference utterance. Accordingly, there is a need to normalize speaking fluctuations in order for utterance comparison to yield proper speech recognition, speaker verification and/or channel normalization.
One well known and widely used statistical method of characterizing the spectral properties of the frames of a pattern is known as the hidden Markov model (HMM) approach. Another approach which has been widely used to determine pattern dissimilarity measurements is known as Dynamic Time Warping (DTW).
Essentially, DTW seeks to normalize time differences between the test utterance and the reference utterance so as to better align the two utterances in time for comparison. DTW uses dynamic programming optimization techniques to achieve some degree of enhanced temporal alignment.
Dynamic programming is an extensively studied and widely used tool in operations research for solving various decisional problems. In general, dynamic programming is used to arrive at an operable path and score (value) associated with sequential decisional problems. Such optimization techniques are often referred to as "the traveling salesman problem."
The DTW and HMM techniques, as well as the dynamic programming techniques, are described in greater detail in Rabiner, Lawrence et al., Fundamentals of Speech Recognition, Prentice Hall: New Jersey, 1993.
Time alignment using standard DTW methods, however, still fails to address proper alignment of the test and reference utterances to achieve accurate comparison of these utterances. In particular, when two utterances are spoken by different speakers, DTW methods have great difficulty in aligning speech due to spatial differences between the test and reference data in the frequency domain. Mismatches in time alignment create poor warping functions and therefore skew the data so as to impede meaningful comparison between the utterances. These errors in the resulting time-warped features are skewed in such a way so as to lead to the conclusion that two vocal tracts are more similar than they really are (due to minimization errors in the DTW algorithm). Accordingly, a speaker verification system could potentially grant access to impostors in view of these errors associated with DTW techniques. Moreover, such a speaker verification system could falsely reject users that should be granted access, particularly due to changing channel (e.g. telephone and telephone line) characteristics.
In the past, there have been various attempts to use both HMM and DTW for speaker verification. For example, in U.S. Pat. No. 3,700,815, an automatic speaker verification system using non-linear time alignment of acoustic parameters is disclosed. Similarly, in U.S. Pat. No. 4,344,031, a method and device for verifying signals, especially speech signals, is disclosed. However, these references fail to alleviate the problems associated with improper alignment resulting from DTW analysis.
A graphical representation of DTW analysis is presented in FIGS. 1A, 1B and 2. FIG. 1A is a graphical depiction of the features of a single reference utterance. As can be seen in FIG. 1A, such features are represented in a frequency versus time graph. Similarly, test features can be depicted graphically as shown in FIG. 1B. However, in view of the fact that there are time distortions between the test and reference utterances, DTW achieves some degree of alignment between the reference and test features to yield the warped features of FIG. 2. The warped features of FIG. 2 are achieved through the "stretching" and "shrinking" in time of features by DTW to make the test utterance best match the reference utterance. Once such DTW analysis is completed, the warped features can be compared to the reference features to achieve speaker verification, speech recognition, channel normalization, etc. However, as can be seen in FIG. 2, there are numerous mis-alignments in the frequency domain which are not properly addressed by DTW techniques.
Accordingly, DTW fails to achieve sufficient alignment of the reference features with the test features to achieve an accurate comparison between the reference utterance and the test utterance when the frequency characteristics of the features are not taken into account. Accordingly, significant errors may result from using DTW for speaker verification, speech recognition, etc. Accordingly, many have abandoned the use of DTW techniques in view of these difficulties.