1. Technical Field
This invention relates to automatic speech recognition, and more particularly, to a system that isolates spoken utterances from background noise and non-speech transients.
2. Related Art
Within a vehicle environment, Automatic Speech Recognition (ASR) systems may be used to provide passengers with navigational directions based on voice input. This functionality increases safety concerns in that a driver's attention is not distracted away from the road while attempting to manually key in or read information from a screen. Additionally, ASR systems may be used to control audio systems, climate controls, or other vehicle functions. ASR systems enable a user to speak into a microphone and have signals translated into a command that is recognized by a computer. Upon recognition of the command, the computer may implement an application. One factor in implementing an ASR system is correctly recognizing spoken utterances. This requires locating the beginning and/or the end of the utterances (“end-pointing”).
Some systems search for energy within an audio frame. Upon detecting the energy, the systems predict the end-points of the utterance by subtracting a predetermined time period from the point at which the energy is detected (to determine the beginning time of the utterance) and adding a predetermined time from the point at which the energy is detected (to determine the end time of the utterance). This selected portion of the audio stream is then passed on to an ASR in an attempt to determine a spoken utterance.
Energy within an acoustic signal may come from many sources. Within a vehicle environment, for example, acoustic signal energy may derive from transient noises such as road bumps, door slams, thumps, cracks, engine noise, movement of air, etc. The system described above, which focuses on the existence of energy, may misinterpret these transient noises to be a spoken utterance and send a surrounding portion of the signal to an ASR system for processing. The ASR system may thus unnecessarily attempt to recognize the transient noise as a speech command, thereby generating false positives and delaying the response to an actual command.
Therefore, a need exists for an intelligent end-pointer system that can identify spoken utterances in transient noise conditions.