Substantial effort has heretofore been exerted in developing speech recognition systems which are speaker-dependent or that require each user of the system to enroll his own voice in order to achieve acceptably high recognition performance. A more difficult task has been to develop a speaker-independent speech recognition system which will recognize a person's speech without that particular person being required to supply the system with samples of his speech prior to usage.
Previously developed speech recognition systems have attempted to achieve speaker independence by sampling speech data from a large number of speakers. These data are then either averaged together to form a single representative reference pattern for each word or the data are clustered into a variety of multiple reference patterns in which each pattern is supposed to represent a particular dialectical or acoustical manifestation of the word. Unfortunately, such prior approaches have been relatively ineffectual and have provided generally unacceptable speaker-independent speech recognition performance.
M. R. Sambur and L. R. Rabiner, in the article "A Speaker-Independent Digit-Recognition System," published in the Bell System Technical Journal, Volume 54, No. 1, January, 1975, disclosed an algorithm for speaker-independent recognition of isolated digit words by segmenting the unknown word into three regions and then making categorical judgments as to which of the six broad acoustic classes each segment falls. Digit verification is then provided, since each digit has unique categorical patterns. Preliminary experiments were conducted based on the same technique and a complex connected speech segmentation algorithm, as reported in the article "Some Preliminary Experiments in the Recognition of Connected Digits," IEEE Transactions, ASSP, April, 1976. However, such prior algorithm approaches have provided only moderate recognition rates.
More recently improved techniques have been disclosed utilizing algorithms which provide higher accuracy word recognition rates, such as disclosed in the articles "Application of Dynamic Time Warping to Connected Digit Recognition," by Rabiner and Schmidt, IEEE Trans, ASSP, August, 1980 and "Connected Digit Recognition Using a Level-Building DTW Algorithm," by Myers and Rabiner, IEEE Trans, ASSP, June, 1981. However, such techniques require relatively large numbers of reference templates for each word and require multiple passes of input data due to the relatively inefficient level building time registration techniques.
Previously developed speaker-independent recognition systems have, in general, neglected a fundamental problem which has created ineffective performance. This problem is the inadequacy of the measures of the speech data used to discriminate the basic sounds of speech. Measures which are normally used are typically derived from a frame-by-frame analysis of speech. For example, the speech may be modeled as a sequence of steady-state frames, with each frame covering about 20 milliseconds or about 50 frames per second, and the speech signal is represented by a sequence of speech features with one set of speech features being computed for each frame of speech data.
With such prior speech recognition techniques, speech discrimination is thus typically achieved by computing the Euclidean distance between corresponding frames of input and reference feature vectors. This has appeared to be an optimum statistical solution to the problem, assuming that (1) adjacent frames of the speech signal are uncorrelated and (2) the variability of the speech signals are independent of the word or sound which produces them. Unfortunately, both of these two assumptions are incorrect and have thus created inaccuracies and unreliability in previous speech recognition techniques. A need has thus arisen for a speech recognition technique which is not based upon the above-noted assumptions and which provides improved speaker-independent speech recognition.