The present invention relates to speech recognition systems and, more particularly, to apparatus and methods for performing shift invariant speech recognition.
Speech recognition is an emerging technology. More and more often it is replacing classical data entry or order taking, which typically require filling out of forms, typing or interacting with human operators. Typically an initial step in a computerized speech recognition system involves the computation of a set of acoustic features from sampled speech. The sampled speech may be provided by a user of the system via an audio-to-electrical transducer, such as a microphone, and converted from an analog representation to a digital representation before sampling. Typically, a classical acoustic front-end (processor) is employed to compute the acoustic features from the sampled speech. The acoustic features are then submitted to a speech recognition engine where the utterances are recognized.
However, it is known that one of the major problems inherent with most speech recognition systems is that they are not translation invariant. In other words, when the sampled speech signal is translated (i.e., shifted) by merely a few milliseconds, the speech recognition system may experience a large variation in performance. That is, an unacceptable recognition error rate may be experienced. It is known that such spurious shifting of the input utterance may be caused by one or more factors, for example, the speaker may pause when speaking into the microphone thereby causing a shift of several milliseconds. While this problem is a result of the discrete character of the signal processing techniques used to extract the acoustic features, no existing system has addressed, no less solved, this issue.