This description relates to dynamic time warping of speech.
Speech is a time-dependent process of high variability. One variability is the duration of a spoken word. Multiple utterances of a particular word by a single speaker may have different durations. Even when utterances of the word happen to have the same duration, a particular part of the word will often have different durations among the utterances. Durations also vary between speakers for utterances of a given word or part of a word.
Speech processing, for example, speech recognition, often involves comparing two instances of a word, such as comparing an uttered word to a model of a word. The durational variations of uttered words and parts of words can be accommodated by a non-linear time warping designed to align speech features of two speech instances that correspond to the same acoustic events before comparing the two speech instances. Dynamic time warping (DTW) is a dynamic programming technique suitable to match patterns that are time dependent. (See, for example, chapter 3 of “Isolated Word Recognition Using Reduced Connectivity Neural Networks With Non-Linear Time Alignment Methods”, PhD dissertation of Mary Jo Creaney-Stockton, BEng., MSc. Department of Electrical and Electronic Engineering, University of Newcastle-Upon-Tyne, August 1996, http://www.moonstar.com/˜morticia/thesis/chapter3.html.
The result of applying DTW is a measure of similarity of a test pattern (for example, an uttered word) and a reference pattern (e.g., a template or model of a word). Each test pattern and each reference pattern may be represented as a sequence of vectors. The two speech patterns are aligned in time and DTW measures a global distance between the two sequences of vectors.
As shown in FIG. 1, a time-time matrix 10 illustrates the alignment process. The uttered word is represented by a sequence of feature vectors 12 (also called frames) arrayed along the horizontal axis. The template or model of the word is represented by a sequence of feature vectors 14 (also called frames) arrayed along the vertical axis. The feature vectors are generated at intervals of, for example, 0.01 sec (e.g., 100 feature vectors per second). Each feature vector captures properties of speech typically centered within 20-30 msec. Properties of the speech signal generally do not change significantly within a time duration of the analysis window (i.e., 20-30 msec). The analysis window is shifted by 0.01 sec to capture the properties of the speech signal in each successive time instance. Details of how a raw signal may be represented as a set of features are provided in: L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”, Prentice-Hall, 1978, Chapter 4: Time-Domain Methods for Speech Processing, pp117-171, and “Fundamentals of Speech Recognition” Lawrence Rabiner, and Biing-Hwang Juang, Prentice Hall, 1993, Chapter 3., Signal Processing and Analysis Methods for Speech Recognition, pp. 69-139.
In the example of FIG. 1, the utterance is SsPEEhH, a noisy version of the template SPEECH. The utterance SsPEEhH will typically be compared to all other templates (i.e., reference patterns or models that correspond to other words) in a repository to find the template that is the best match. The best matching template is deemed to be the one that has the lowest global distance from the utterance, computed along a path 16 that best aligns the utterance with a given template, i.e., produces the lowest global distance of any alignment path between the utterance and the given template. By a path, we mean a series of associations between frames of the utterance and corresponding frames of the template. The complete universe of possible paths includes every possible set of associations between the frames. A global distance of a path is the sum of local distances for each of the associations of the path.
One way to find the path that yields the best match (i.e., lowest global distance) between the utterance and a given template is by evaluating all possible paths in the universe. That approach is time consuming because the number of possible paths is exponential with the length of the utterance. The matching process can be shortened by requiring that (a) a path cannot go backwards in time; (i.e., to the left or down in FIG. 1) (b) a path must include an association for every frame in the utterance, and (c) local distance scores are combined by adding to give a global distance.
For the moment, assume that a path must include an association for every frame in the template and every frame in the utterance. Then, for a point (i, j) in the time-time matrix (where i indexes the utterance frame, j the template frame), the previous point must have been (i−1, j−1), (i−1, j) or (i, j−1) (because of the prohibition of going backward in time).
The principle of dynamic programming (DP) is that a point (i, j), the next selected point on the path, comes from one among (i−1, j−1), (i−1, j) or (i, j−1) that has the lowest distance. DTW refers to this application of dynamic programming to speech recognition. DP finds the lowest distance path through the matrix, while minimizing the amount of computation. The DP algorithm operates in a time-synchronous manner by considering each column of the time-time matrix in succession (which is equivalent to processing the utterance frame-by-frame). For a template of length N (corresponding to an N-row matrix), the maximum number of paths being considered at any time is N. A test utterance feature vector j is compared to all reference template features, 1 . . . N, thus generating a vector of corresponding local distances d(1, j), d(2, j), . . . d(N, j).
If D(i, j) is the global distance up to, but not including point (i, j) and the local distance of (i, j) is given by d(i, j), thenD(i, j)=min[D(i−1, j−1), D(i−1, j), D(i, j−1)]+d(i, j)  (1)
Given that D(1, 1)=d(1, 1) (this is the initial condition), we have the basis for an efficient recursive algorithm for computing D(i, j). The final global distance D(M, N) at the end of the path gives the overall lowest matching score of the template with the utterance, where M is the number of vectors of the utterance. The utterance is then recognized as the word corresponding to the template with the lowest matching score. (Note that N may be different for different templates.)
For basic speech recognition, DP has a small memory requirement. The only storage required by the search (as distinct from storage required for the templates) is an array that holds a single column of the time-time matrix.
Equation 1 enforces the rule that the only directions in which a path can move when at (i, j) in the time-time matrix is up, right, or diagonally up and right. Computationally, equation 1 is in a form that could be recursively programmed. However, unless the language is optimized for recursion, this method can be slow even for relatively small pattern sizes. Another method that is both quicker and requires less memory storage uses two nested “for” loops. This method only needs two arrays that hold adjacent columns of the time-time matrix.
Referring to FIG. 2, the algorithm to find the least global distance path is as follows (Note that, in FIG. 2, which shows a representative set of rows and columns, the cells at (i, j) 22 and (i, 0) have different possible originator cells. The path to (i, 0) 24 can originate only from (i−1, 0). But the path to any other (i, j) can originate from the three standard locations 26, 28, 29):                1. Calculate the global distance for the bottom most cell of the left-most column, column 0. The global distance up to this cell is just its local distance. The calculation then proceeds upward in column 0. The global distance at each successive cell is the local distance for that cell plus the global distance to the cell below it. Column 0 is then designated the predCol (predecessor column).        2. Calculate the global distance to the bottom most cell of the next column, column 1 (which is designated the curCol, for current column). The global distance to that bottom most cell is the local distance for that cell plus the global distance to the bottom most cell of the predecessor column.        3. Calculate the global distance of the rest of the cells of curCol. For example, at cell (i, j) this is the local distance at (i, j) plus the minimum global distance at either (i−1,j), (i−1, j−1) or (i, j−1).        4. curCol becomes predCol and step 2 is repeated until all columns have been calculated.        5. Minimum global distance is the value stored in the top most cell of the last column.        
The pseudocode for this process is:
calculate first column (predCol)for i=1 to number of input feature vectorscurCol[0] = local cost at (i,0) + global cost at (i−1,0)for j=1 to number of template feature vectorscurCol[j]=local cost at (i,j) + minimum of global costs at (i−1,j), (i−1,j−1) or (i,j−1).endpredCol=curColendminimum global cost is value in curCol[number of template feature vectors]
To perform recognition on an utterance, the algorithm is repeated for each template. The template file that gives the lowest global matching score is picked as the most likely word.
Note that the minimum global matching score for a template need be compared only with a relatively limited number of alternative score values representing other minimum global matching scores (that is, even though there may be 1000 templates, many templates will share the same value for minimum global score). All that is needed for a correct recognition is the best matching score to be produced by a corresponding template. The best matching score is simply the one that is relatively the lowest compared to other scores. Thus, we may call this a “relative scoring” approach to word matching. The situation in which two matches share a common score can be resolved in various ways, for example, by a tie breaking rule, by asking the user to confirm, or by picking the one that has lowest maximal local distance. However, the case of ties is irrelevant for recognizing a single “OnWord” and practically never occurs.
This algorithm works well for tasks having a relatively small number of possible choices, for example, recognizing one word from among 10-100 possible ones. The average number of alternatives for a given recognition cycle of an utterance is called the perplexity of the recognition.
However, the algorithm is not practical for real-time tasks that have nearly infinite perplexity, for example, correctly detecting and recognizing a specific word/command phrase (for example, a so-called wake-up word, hot word or OnWord) from all other possible words/phrases/sounds. It is impractical to have a corresponding model for every possible word/phrase/sound that is not the word to be recognized. And absolute values of matching scores are not suited to select correct word recognition because of wide variability in the scores.
More generally, templates against which an utterance are to be matched may be divided between those that are within a vocabulary of interest (called in-vocabulary or INV) and those that are outside a vocabulary of interest (called out-of-vocabulary or OOV). Then a threshold can be set so that an utterance that yields a test score below the threshold is deemed to be INV, and an utterance that has a score greater than the threshold is considered not to be a correct word (OOV). Typically, this approach can correctly recognize less than 50% of INV words (correct acceptance) and treats 50% of uttered words as OOV (false rejection). On the other hand, using the same threshold, about 5%-10% of utterances of OOV words would be recognized as INV (false acceptance).