This invention relates to speech recognition and speaker verification systems in general and more particularly to a dynamic time warping (DTW) apparatus useful in speech recognition and speaker verification systems.
As one can understand, speech recognition and speaker verification systems have been proposed in the prior art which operate to recognize isolated or connected utterances by comparing unknown audio signals suitably processed with one or more previously prepared representations of known signals. In this respect, the known or stored signals for keyword spotting are sometimes referred to as key words and are provided by means of templates which are stored and then compared with the incoming speech in order to determine a match.
Thus, one can understand, there are numerous references which exist in the prior art which relate to such systems. Different systems operate on different principles and essentially such systems attempt to recognize an unknown audio signal by comparing the signal with various stored means such as templates and so on, as is well known in the art.
One type of system is referred to as word spotting where in word spotting one responds to incoming speech to detect words of interest. The number of words of interest are called key words and is usually a small number. The goal is to determine the instant in time when any key words are spoken and which key word it is. Hence there are many systems which operate in regard to the recognition of key words.
Known methods of speech recognition word spotting and speaker verification use a technique called dynamic time warping (DTW). DTW allows computer representation of two different utterances of the word to be brought into time alignment with one another. This is done by compressing or expanding or both compressing and expanding in different places the time axis of one representation. The purpose of DTW is to compensate for differences between two utterances in pronunciation or speaking rate. In practice one of the two representations is an example of the word, called a template. DTW is used to measure the similarity between the template and segments of input speech which might be utterances of the same words.
The computer representation of utterances referred to above is as follows. The signal from a microphone is analyzed within contiguous time intervals called frames. The result of the analysis is a vector each frame that, specifies the power spectrum as a function of frequency of that frame. A sequence of such vectors over the period of an utterance is the computer representation of the utterance. A sequence of such vectors over the period of a key word could be used as the template for the key word.
Essentially, in such a system, the distance between, for example, the unknown speech and a template is referred to as the Euclidian distance and this distance is calculated by such systems. A DTW system operates to find the path that minimizes the sum of the distances in regard to the templates through which the speech signal is processed. For each input frame a DTW computation can proceed from the first template frame to the last.
DTW was originally used for the recognition of isolated words with known end points and this has been discussed in many references. See, for example, an article entitled "An Efficient Elastic Template Method for Detecting Given Key Words in Running Speech" by J. S. Bridle, British Acoustical Society Meeting, pp. 1-4, Apr. 1973. See, also, an article entitled "An Algorithm for Connected Word Recognition" published in the Proceedings. International Conference Acoustic Speech and Signal Processing, Paris, France 1892 by J. S. Bridle, N. D. Brown and R. N. Chamberlain. Various other prior art references discuss such systems employing dynamic time warping.
An extremely important part of any system is the circuitry which operates to provide the DTW functions. Such circuitry of course must be relatively economical to produce, simple to fabricate and operate efficiently and reliably.