Technical Field
The disclosed embodiments are directed to audio signal processing and, specifically to systems, devices, and methods for enhancing the quality of an audio signal using sub-band deep neural network (DNN) systems.
Description of the Related Art
Currently, many devices utilize speech enhancement to process audio signals. Examples of such devices include “personal assistants” such as APPLE'S SIRI, GOOGLE HOME, AMAZON ALEXA, and various other devices. These devices include microphone elements that capture analog audio signals and attempt to perform speech recognition to convert human speech into data structures that a microprocessor can process. In general, these devices attempt to improve speech quality of a degraded audio signal by reducing the effect of background noise and compensating for low signal-to-noise ratios (SNRs).
Environmental noise and room reverberation are key factors in speech recognition and audio processing. Often, these factors severely degrade the functionality of a device, often rendering the device inoperable. Current techniques for addressing these factors still result in less than ideal outputs. Compared to additive and multiplicative noise, reverberation poses different challenges in audio enhancement because a longer time span of an audio signal may be affected by reverberation. Thus, current techniques for addressing additive and multiplicative noise are incapable of adequately handling reverberation in audio signals.
Currently, systems utilize various techniques to address the shortcomings of additive and multiplicative noise reduction. Some systems have utilized a minimum mean square error (MMSE) algorithm to reduce speech reverberation. However, MMSE results in a distorted output signal. Other systems have utilized weighted prediction error (WPE) algorithms based on linear prediction to estimate late reverberation under the assumption that the algorithms can linearly predict later reverberation based on previous signals. While this technique is adequate in multi-channel signals, the technique fails to accurate predict reverberation in single-channel cases.
The MMSE and WPE techniques are based on statistic assumptions of a room's acoustic model. Other techniques have used DNNs to predict a clean signal from a degraded signal. DNNs learn complex non-linear mappings using a training input set and a known output set. The use of DNNs to predict clean signals improved upon the MMSE and WPE approaches but are still lacking. Specifically, all DNN-based approaches utilize a full spectrum band as an input to a DNN. For example, some systems have relied upon determining an ideal ratio mask (IRM) as a DNN training target. Thus, these systems generate a complex IRM as a DNN output.
Thus, there currently is a technical deficiency in current systems in processing degraded audio signals. Specifically, current technical solutions fail to produce clean audio signals from degraded audio signals using MMSE and WPE algorithms, among others. Further, these systems are unable to remove the effects of reverberation in a single-channel audio signal.