Contemporary handheld portable electronic devices, such as mobile phones and portable media players, typically include user interfaces that incorporate speech or natural language recognition to initiate processes or perform tasks. However, for core functions, such as turning on or off the device, manually placing the device into a sleep mode, and waking the device from the sleep mode, handheld portable electronic devices generally rely on tactile inputs from a user. This reliance on tactile user input may in part be due to the computational expense required to frequently (or continuously) perform speech recognition using a processor of the device. Further, a user of a portable electronic device typically must direct his or her speech to a specific microphone whose output feeds a speech recognition engine, in order to avoid problems with ambient noise pickup.
Mobile phones now have multiple distant microphones built into their housings to improve noise suppression and audio pickup. Speech picked up by multiple microphones may be processed through beamforming. In beamforming, signals from the multiple microphones may be aligned and aggregated through digital signal processing, to improve the speech signal while simultaneously reducing noise. This summed signal may then be fed to an automatic speech recognition (ASR) engine, and the latter then recognizes a specific word or phrase which then triggers an action in the portable electronic device. To accurately detect a specific word or phrase using beamforming, a microphone occlusion process may be required to run prior to the beamforming. This technique however may result in too much power consumption and time delay, as it requires significant digital signal processing to select the “best” microphones to use, and then generate a beamformed signal therefrom.