This invention relates to keyword spotting in audio signals, and more particularly to multi-task configuration of a system for keyword spotting.
Automated speech recognition (ASR) involves acquisition of data representing an acoustic input, which generally includes human-produced speech. An ASR system processes that data in order ultimately to act according that speech. For example, the user may say “Play music by the Beatles” and the processing of the acquired data representing the acoustic input that includes that utterance causes Beatles music to be played to the user.
Different applications of ASR generally make use of somewhat different types of processing. One type of data processing of the data aims to transcribe the words spoken in an utterance prior to acting on a command represented by the transcribed words. In some such applications, the user indicated that he wishes to speak by pressing a button. Data acquisition may be initiated by the button press and terminated when speech is no longer found in the data. Such processing is sometimes referred to as a “push-to-talk” approach.
Some applications of ASR do not require the user to press a button. In one approach, the user can signal that he wishes to speak a command by first speaking a “trigger” word, also referred to as a “wake” word. In some systems, the user may immediately follow the trigger word with the command, for example, as “Alexa, play music by the Beatles” where “Alexa” is the trigger word. Processing the data to detect the presence of the trigger word is often referred to as word spotting (or keyword spotting). An ASR system that monitors an acoustic signal waiting for the user to speak an appropriately structured utterance (such as an utterance beginning with a trigger word) may be referred to as implementing an “open microphone” approach.