A variety of different audio-signal-processing techniques exist for a variety of different purposes. One such purpose is to remove “echo” and ambient interference signals or “noise” from one or multiple input audio channels, in order to isolate the sound that would be present in the absence of such signals. For example, as smart-speaker devices, such as the Amazon Echo™ device, become popular, far-field voice signal isolation and processing have become more important. Such devices typically include one or more microphones, for receiving spoken input from a user. They also include one or more speakers (1) for responding to, and/or providing information requested by, the user, using text-to-speech (TTS) processing, and/or (2) for playing other audio content, such as music.
Within such a context, it often is desirable to identify what a user is saying at the same time that such other content (e.g., music or TTS) is playing through the device's speaker(s) and/or when other ambient sound sources are creating interference. However, the audio signal received at the device's microphones (i.e., multiple microphones commonly being used) typically contains some version of such other played audio content, in addition to the user's voice.
Conventionally, in order to address this problem, two major signal-processing components of such a system are echo cancellation and beamforming. Echo cancellation (i.e., removal, or at least reduction, of the portion of the received audio signal resulting from the played content) often is critical to the performance of “keyword activation” (KA) and/or speech recognition (ASR) when the smart-speaker device is playing other audio content (e.g. music, TTS responses, etc.). Using sub-band (e.g., frequency-domain) processing, performance (including convergence rate and steady state echo reduction) of echo cancellation (EC) has improved to the point that it often is now able to handle a smart-speaker device's most difficult cases—where the device's speaker is playing loudly and the user is standing far away. Beamforming (which relies on the use of multiple microphones to achieve programmably selective directionality) also can significantly improve KA and ASR performance, particularly in the presence of room reverberation and environmental noise.
An exemplary conventional system 10 is illustrated in FIG. 1. As shown, multiple microphones 12 (e.g., microphones 12A-C) input corresponding audio signals. Each such audio signal (typically after analog-to-digital conversion, not shown) is then decomposed into separate frequency bands using a corresponding analysis/decomposition module 14 (e.g., one of modules 14A-C). A reference signal 15, typically a digital signal corresponding to what is being played through the device's speaker(s), similarly is decomposed into separate frequency bands using an analysis/decomposition module 14 (module 14D in FIG. 1). Each such decomposed input audio signal (from a given microphone) is then processed together with the decomposed reference signal in a separate corresponding echo-cancellation module 18 (e.g., one of modules 18A-C). Next, for each of the subbands, a separate beamformer module 20 (e.g., one of modules 20A-C) processes the output for that subband from all of the echo-cancellation modules 18. The individual frequency bands output by the corresponding individual beamformer modules 20 are then resynthesized by subband resynthesis module 24 to provide a final output signal 25.
The signals input by the individual microphones 12 are denoted herein as xi(t), i=1, . . . , N, where N is the number of microphones. The echo reference signal is denoted herein as r(t). Both xi(t) and r(t) are processed by the sub-band analysis/decomposition modules 14, which processing typically includes D times down-sampling. The outputs of the analysis/decomposition modules are denoted herein as xi,mD(t) and rmD(t), m=1, . . . , M, where M is the number of sub-bands. As indicated above, each microphone's echo cancellation is done independently in a separate echo-cancellation module 18 (e.g., one of modules 18A-C). Each such echo-cancellation module 18, in turn, typically includes M sub-band EC submodules (not shown). The EC signals output from the echo-cancellation modules 18 are denoted herein as {circumflex over (x)}i,mD(t), i=1, . . . , N, m=1, . . . , M. Following the EC processing 18, the beamforming 20 is done in each sub-band independently. That is, each beamformer module 20 processes a different sub-band across all the EC-processed microphone signals.
Each sub-band's beamforming can be done as if in the time domain, i.e. filter-and-sum. Another option is to first conduct a Fast Fourier Transform (FFT) analysis in each sub-band and then do beamforming in each bin, followed by inverse Fast Fourier Transform (iFFT) processing, so that a sub-band signal stream is again obtained. The outputs of the beamforming modules 20, designated herein as zm(t), m=1, . . . , M, are input into the sub-band resynthesis module 24, which generates the system's output signal 25, designated herein as y(t).