It is known to provide automatic speech recognition (ASR) for mobile devices using remotely-located speech recognition algorithms accessed via the internet. This speech recognition can be used to recognise spoken commands, for example for browsing the internet and for controlling specific functions on, or via, the mobile device. In order to preserve battery life, these mobile devices spend most of their time in a power saving stand-by mode. A trigger phrase may be used to wake the main processor of the device such that speaker verification (i.e. verification of the identity of the person speaking), and/or any other speech analysis service, can be carried out, either within the main processor and/or by a remote analysis service.
In order to improve the recognition rates in the ASR service, it is known to use various signal processing techniques which enhance the audio, i.e. speech, before transmission, for example acoustic echo cancellation, noise reduction and multi-microphone beamforming. Many of these enhancement techniques are adaptive, that is, they modify their parameters dynamically in order to adapt to the acoustic environment in which the microphone signal is being provided. Upon a change of acoustic environment it takes a finite period of time for these parameters to be iteratively adapted to a point where any undesired features, produced by the acoustic environment, are reduced to an insignificant level. This is known as the adaptation time, and for many adaptive audio signal processing algorithms is typically of the order of one second.
Acoustic echo cancellation (AEC) uses an adaptive process as described above to cancel the local loudspeaker contribution that may be picked up by a speech microphone, by using a reference signal derived from the output to the loudspeaker, and an adaptive process to estimate the acoustic transfer function from the loudspeaker to the microphone. This adaptation can take place on any signal output from the loudspeaker. It is therefore not dependent on a signal being input by a user through the microphone. Some typical uses for ASR during loudspeaker operation are voice control of music playback, and voice control during speakerphone telephony. For these cases, the AEC can converge to the environment within one second of the loudspeaker output commencing, and therefore, in most cases the adaptation has reached the required level before a user starts to issue spoken commands.
In contrast, adaptive noise reduction and multi-microphone beamforming are adaptive processes that do depend on a signal being produced containing the user's speech. These adaptive processes cannot start to adapt their parameters until the user's speech is present in the signal from a microphone, and, once the user's speech is present, they take a period of time to adapt to the required level. These adaptive processes may be required to enhance speech for use in ASR immediately following a voice-triggered wake-up from standby. It also may not be feasible to run these speech recognition algorithms in the low-power standby state, as their computational complexity causes the resultant device power consumption to be relatively significant. The net result of this is that the start of the spoken command may not be effectively enhanced, which may cause a poor result in the ASR service.