Natural language processing systems include various modules and components for receiving input from a user (e.g., audio, text, etc.) and determining what the user meant. In some implementations, a natural language processing system includes an automatic speech recognition (“ASR”) module that receives audio input of a user utterance and generates one or more likely transcriptions of the utterance. Automatic speech recognition modules typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which subword units (e.g. phonemes or triphones) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine the most likely transcription of the utterance based on the hypotheses generated using the acoustic model and lexical features of the language in which the utterance is spoken.
Many devices configured to obtain audio data of user utterances include both a loudspeaker and a microphone. The loudspeaker is used to play audio signals, such as speech from a remote source during a telephone call, audio content presented from local storage or streamed from a network etc. The microphone is used to capture audio signals from a local source, such as a user speaking voice commands or other utterances. An acoustic echo occurs when the remote signal emitted by the loudspeaker is captured by the microphone, after undergoing reflections in the local environment.
An acoustic echo canceller (“AEC”) may be used to remove acoustic echo from an audio signal captured by a microphone in order to facilitate improved communication. The AEC typically filters the microphone signal by determining an estimate of the acoustic echo (e.g., the remote audio signal emitted from the loudspeaker and reflected in the local environment). The AEC can then subtract the estimate from the microphone signal to produce an approximation of the true local signal (e.g., the user's utterance). The estimate is obtained by applying a transformation to a reference signal that corresponds to the remote signal emitted from the loudspeaker. The transformation is typically implemented using an adaptive algorithm. Adaptive transformation relies on a feedback loop, which continuously adjusts a set of coefficients that are used to calculate the estimated echo from the far-end signal. Different environments produce different acoustic echoes from the same loudspeaker signal, and any change in the local environment may change the way that echoes are produced. By using a feedback loop to continuously adjust the coefficients, an AEC to can adapt its echo estimates to the local environment in which it operates.