The present invention relates to speech recognition and more particularly to an improved method of aligning an unknown speech segment with a reference or model speech segment.
As is understood by those skilled in the speech recognition art, differences in manner and speed of speaking, not only from one speaker to another but also from different instances of speech by the same speaker, require that some procedure be utilized for aligning an unknown speech segment with each of the vocabulary models with which it is to be compared before determining a score or measure representing the likelihood of match between the unknown and each model. These procedures are sometimes referred to as time warping. One common method of performing this alignment or matching is so-called Viterbi decoding.
Typically, the unknown speech segment is represented by a segment of multi-dimensional frames and each of the vocabulary model segments is represented by a sequence of states which may themselves be frames or probability distributions of frames. The frames may, for example, comprise essentially instantaneous spectra of the speech sound but it should be understood that other characteristics, such as LPC coefficients, might also be used. In order to obtain a meaningful match calculation, the sequences of frames and states must be relatively finely gradiated. For example, a single vocabulary word or model may be represented by a sequence comprising in the order of 50 states. The number of states will of course vary from word to word. The unknown speech sequences will also typically comprise a similar quantity.
A comparison between an unknown speech segment and a model segment can thus be thought of as a matrix, and the alignment process can be considered as determining a best path through the matrix, i.e. a path which results in the best possible score for the matching of the unknown with that particular model. A path is essentially a sequence of frame/state pairs which satisfies certain constraints: successive frame/state pairs are within a necessary grid distance of each other so that there is continuity along the path; time must not be reversed, i.e. the path cannot go back on itself; and all of the input and all of the model should be accounted for, i.e. the search is typically to determine a best path from the origin to the diagonally opposite corner of the matrix.
To rigorously determine a best path through the matrix, it is essentially necessary to calculate the cost of each possible transition from one matrix location to its neighbors and to then calculate the cumulative costs of various paths through the matrix. In actual practice, the computational cost of exhaustive or rigorous determination of a best path can be practically prohibitive and, accordingly, various schemes have been proposed for limiting the search space. It has, for example, been proposed to limit the search area to a corridor which is of fixed width from a simple diagonal from the origin to the far corner. Other predetermined corridor shapes have also been proposed. With each of these schemes, however, there is substantial risk that, if the corridor is made narrow enough to appreciably reduce the level of computation required, the accuracy of the resultant score may be impacted, since there is an appreciable likelihood that the best path will lie outside the corridor. In other words, speed is considerably improved by the use of a narrow corridor but likelihood of error is also substantially increased. Conversely, if a very broad corridor is implemented, the decrease in computation required may mot be appreciable.
Beside limitation of search space, other search techniques have been developed in attempts to reduce the computation required. Among these are the so-called "beam search". In this technique, at each input frame position in the grid, all the scores of the paths from the grid origin (typically the bottom left corner) are compared. Those whose scores are worse than the best score by some threshold are eliminated and not pursued further. This is a `local` decision in that it is based only on the patterns between this point and the origin. It is entirely possible that a path which seems poor may become the overall winner once all the data is accounted for. Path deletions based on local criteria are thus dangerous and the computational saving may come at the cost of lower accuracy.
Other schemes, such as the `best-first` (also known as the stack or A*) algorithm, always first pursue the most promising path, frequently reevaluating which is the best path. The hope is that the correct path will be extended all the way across the grid before too much work is expended on the less successful paths. Like the beam-search, the best-first algorithm can suffer from the limitations of decisions which are only locally optimal. To overcome this drawback, these techniques have been enhanced by computing an estimate of the score of completing partial paths to the end. In order to reduce computation, these estimates must be inexpensive compared to the cost of computing the actual path score. The considerable overhead required to maintain complex structures, continually compare paths and make complex decisions based on these comparisons make such search techniques undesirable for the task of limiting the computational cost of aligning an input pattern with a reference model.
Among the several objects of the present invention may be noted the provision of an improved speech recognizer; the provision of such a speech recognizer which utilizes a novel method of deter-mining alignment of an unknown speech segment with a model segment; the provision of such a recognizer which requires less computational effort to achieve a very good alignment; the provision of such a speech recognizer which involves very little risk of excluding a good or best alignment of an unknown speech segment with a model; the provision of such a speech recognizer which is highly reliable and which is of relatively simple and inexpensive implementation. Other objects and features will be in part apparent and in part pointed out hereinafter.