The invention relates generally to speech recognition systems and in particular to a method and apparatus for automatically updating an error compensation signal relating to the characteristics of the speaker and the transfer function between the speaker and the speech recognition system.
It is well known that speech recognition systems contend with many variable factors, such as background noise, the location of the microphone relative to the speaker, the direction in which the speaker is speaking, the context of the speech including the level of emotion in the speaker's voice, the rate of speech, changes due to speaker fatigue, etc. Each of these factors can vary over time and has an adverse effect on the ability of a recognition system to determine or recognize the words or utterances of a (known or unknown) speaker; and accordingly, many different speech recognition approaches have been proposed to correct for or take into account the potential variations which tend to mask the lexical content of the speech signal. Indeed, this is one of the reasons why speech recognition is a difficult and challenging problem. These factors are different than the normal variability in the pronunciation of words for which speech recognition systems apply different recognition techniques.
In one particular speech recognition system, described in Feldman et al, U.S. Pat. No. 4,799,262, filed June 22, 1985, and granted Jan. 24, 1989 (the specification of which is incorporated, by reference, in this application), a speech recognition system is described which uses a code book containing a plurality of quantized vectors which reflect the range of sounds a user can produce, and to which unknown incoming speech frames are compared. Sequences of the vectors thus represent the words in the recognition vocabulary. Each input word is assigned a sequence of vectors and the system recognizes words by examining these sequences. Both the generation of the codebook during an enrollment phase and the assignment of the codebook vectors to input speech during training and recognition are influenced by the so-called speaker or system "transfer function." This function can be viewed as, in effect, distorting the ideal, intended output of the speaker by the time the audio output is converted into electrical signals at the receiving microphone. Incorporated in this transfer function are both characteristics associated with the speaker's voice type, the characteristics of the microphone, the direction and distance of the speaker from the microphone, as well as environmental effects. This transfer function changes over time, due to, for example, changes in the position of the microphone or changes in the user's voice. When such changes occur, the input signal no longer closely matches the codebook, incorrect vectors are selected, and recognition performance suffers. Consequently, it is important to track changes in the transfer function and compensate for them.
One typical way to do so, described in U.S. Pat. No. 4,799,262, is to require the user to periodically perform a "mike check." The mike check consists of the user speaking a small set of known words which are used by the system to compute an estimate of the transfer function and compensate for it. The set of words spoken during a mike check is fixed so as not to compound the problem of tracking changes in the transfer function by introducing changes due to the different spectral content of different words. Alternately, some systems do not require a mike check but average successive input words to deemphasize the spectral contribution of individual lexical items; but to do so properly (especially in specialized applications) requires a very long adaptation time which makes it difficult to adequately track the changing transfer function.
While doing mike checks during enrollment may constitute an adequate solution to the problem of tracking the varying transfer function, users are less willing to interrupt useful work to perform mike checks during recognition and, accordingly, changes in the transfer function can dramatically and adversely affect the resulting accuracy of the speech recognition process as substantial errors of recognition can occur. It is thus highly desirable to track the changing transfer function automatically, without additional work by the user and therefore any such automatic method must be able to operate with unknown input speech and be inherently stable since it must work in an unsupervised fashion.
Systems have employed, for example, methods which determine non-speech boundaries of the incoming speech signal, and then set noise thresholds which depend upon the level of the actual noise measured during non-speech. Still other systems are directed toward providing better representations of the noise during non-speech times so that a better recognition of speech can be obtained by subtracting the "true" noise from the speech signal during actual speech. These systems, however, typically do not take into account the effective "noise" which results from movement of the microphone or the relation between the speaker and the microphone, and changes in the speaker's voice which can vary in a random fashion.
In addition, systems for reducing stationary noise or for filtering near stationary noise have been provided using, for example, Weiner or Kalman filter theory for minimization where a prior knowledge of the noise is acquired and is not assumed. These systems, also, do not take into account the transfer function from speaker to microphone, and do not allow for the automatic adaptation of the actual speech using statistics available during and from the speech process itself.
Other systems try to account for the transfer function but either (a) require knowledge of input text that is, a kind of retraining, (b) assume that the speech recognition outputs are correct and use word identities (an assumption which cannot be properly made) or (c) require various and impractical adaptation time, to average out word information.
Accordingly, an object of the invention is a method and apparatus for continuously updating a data correction signal representing the speaker to microphone system transfer function and which can be employed during actual speech recognition or training of the system. Other objects of the invention are a speech recognition method and apparatus which provide higher accuracy of speech recognition, which adapt to changing speaker/microphone conditions in an automatic manner, which provide accurate high level updating of the speaker transfer function without interrupting the speech recognition process, and which provide for efficient and more precise speech recognition using a vector quantization code book analysis.