Unimodal single-channel audio denoising and source separation are long studied problems. They are especially difficult to address when the intensity of the noise is very high (overwhelming the signal) and non stationary (structured). This is often referred to as the cocktail party problem, which is very challenging, especially when only a single sensor (microphone) is accessible. In audio-video (AV) studies, source separation assumes that all the audio sources are visible in the field of view, e.g., a couple of speakers are seen while they speak. AV analysis, in general, is an emerging topic, prompting studies in a range of interesting tasks. Some vision methods were adapted to unimodal audio analysis.
In audio denoising, noise is commonly assumed to be stationary. Nevertheless, there are unimodal source separation techniques which successfully accomplish separating non-stationary sources. Music and speech signals have inherently different statistics. Thus, many algorithms are distinct for each, while some are oriented to both. There, sparse representations of audio are used.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.