Audio event detection (AED) aims to identify the presence of a particular type of sound data within an audio signal. For example, AED may be used to identify the presence of the sound of a microwave oven running in a region of an audio signal. AED may also include distinguishing among various types of sound data within an audio signal. For example, AED may be used to classify sounds such as, for example, silence, noise, speech, a microwave oven running, or a train passing.
Speech activity detection (SAD), a special case of AED, aims to distinguish between speech and non-speech (e.g., silence, noise, music, etc.) regions within audio signals. SAD is frequently used as a preprocessing step in a number of applications such as, for example, speaker recognition and diarization, language recognition, and speech recognition. SAD is also used to assist humans in analyzing recorded speech for applications such as forensics, enhancing speech signals, and improving compression of audio streams before transmission.
A wide spectrum of approaches exists to address SAD. Such approaches range from very simple systems such as energy-based classifiers to extremely complex techniques such as deep neural networks. Although SAD has been performed for some time now, recent studies on real-life data have shown that state-of-the-art SAD and AED techniques lack generalization power.