Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “digital agents,” “chatbots,” “interactive personal assistants,” “intelligent personal assistants,” “assistant applications,” “conversational agents,” etc.). For example, humans (which when they interact with automated assistants may be referred to as “users”) may provide commands and/or requests to an automated assistant using spoken natural language input (i.e., utterances), which may in some cases be converted into text (or other semantic representation) and then processed. Also, for example, users can additionally or alternatively provide requests by providing textual (e.g., typed) natural language input. An automated assistant responds to a request by providing responsive user interface output, which can include audible and/or visual user interface output.
As mentioned above, many automated assistants are configured to be interacted with via spoken utterances. Spoken utterances are received at a client device via one or more microphones of the client device. For example, each microphone of the client device generates a corresponding audio signal that varies over time in dependence on sound(s) detected at the microphone. The audio signal(s) received via the microphone(s) (and/or frequency domain representations thereof) can be processed (at the client device and/or remote server device(s)) for one or more purposes, such as automatic speech recognition (e.g., converting audio signal(s) to text, phone(s), phoneme(s), and/or other semantic representation).
The client device(s) via which a user interacts with an automated assistant includes an assistant interface that provides, to a user of the client device, an interface for interacting with the automated assistant (e.g., receives spoken and/or typed input from the user, and provides appropriate audible and/or graphical responses), and that interfaces with one or more additional components that implement the automated assistant (e.g., remote server device(s) that process user inputs and generate appropriate responses).
Audio signal(s) that are received via microphone(s) of a client device (and/or frequency domain representations thereof) are often pre-processed prior to being further processed by automatic speech recognition (ASR) component(s) and/or other component(s). Pre-processing of the audio signal(s) (and/or frequency domain representations thereof) can include, for example, dereverberation. Dereverberation of audio data processes the audio data in an attempt to reduce or eliminate reverberation(s) present in the audio data.
Reverberation(s) are reflection(s) of a sound that are created when the sound is reflected off sound-reflective surface(s) of object(s) (e.g., walls, furniture, people), thereby causing reflection(s) of the sound to be created. Reverberation(s) can be present in an audio signal as a corresponding microphone can receive not only the wave-front arriving directly from the source of the sound (e.g., a human speaker) by the shortest path, but can also receive longer path reflection(s) of that wave-front from surrounding object(s). The longer path reflection(s) of a wave-front are time-delayed relative to the wave-front that arrives by the shortest path, and are typically reduced in amplitude relative to the wave-front that arrives by the shortest path. The reflection(s) of a sound in an audio signal generated by a microphone will be dependent on, for example, the size of the room (or other environment), the position and nature of sound-reflective surface(s), the position of the source of the sound, the position of the corresponding microphone, etc.
Although techniques exist for pre-processing audio data to mitigate reverberation (mitigation of reverberation is also referred to herein as “dereverberation”), such techniques can suffer from one or more drawbacks. For example, in pre-processing audio data that includes a spoken utterance, some techniques can require audio data for the entire spoken utterance to be obtained before dereverberation can be performed on the audio data. For instance, some techniques rely on value(s) for dereverberation (e.g., so called “tap” values), where the value(s) cannot be calculated until the audio data for the entire spoken utterance is received. Waiting for audio data for the entire spoken utterance to be obtained can lead to latency in dereverberation, and a resulting latency in use of dereverberated audio data for ASR and/or other purposes—thereby also causing latency in generating a response from an automated assistant (which can rely on the ASR in generating the response). As another example, some techniques can experience a degradation in performance when a source of audio (e.g., a human speaker) is non-stationary during the course a spoken utterance. With such techniques, this and/or other factor(s) can lead to a loss of meaningful audio in dereverberation, resulting in a dereverberated audio signal that cannot be properly processed by ASR component(s) and/or other component(s) to ascertain the true meaning of a spoken utterance. This can lead to poor performance of the ASR and, as a result, poor performance of one or more other automated assistant components that rely on output of the ASR.