Audio sound source separation comprises the task of separation of different constituent sources within an audio mixture (the audio mixture comprising sound from a number of sources mixed in a sound field). Currently, most approaches to this problem have been performed ‘offline’, meaning that the entire audio mixture is present at the time of separation (generally in the form of a digital recording), rather than in ‘realtime’, where sources are separated as new audio data are entered into the system. In the cocktail party situation, the presence of multiple competing talkers can make listening to the information transmitted by a single source difficult, but successful sound source separation is able to present the listener with the information present from only a single talker at a time.
In order for sound source separation to be useful in real communication situations, it should be performed in real-time, or at very low latency. If a significant processing delay occurs between audio being spoken, and audio being separated, the listener may be perturbed by the asynchrony between talker mouth movement and corresponding audio, as well as receiving less benefit from possible lip-reading. Therefore, a sound source separation approach which operates at low latency (e.g. less than 20 ms between an audio sample entering and leaving the system) is advantageous. Current (additive mixture model based) sound-source separation approaches rely on the use of fairly long analysis frames (typically of the order of >50 ms), which, if implemented directly, would violate requirements for low latency.
In this context, we consider only what we refer to as ‘data latency’, in that it is assumed that the actual processing algorithms can be executed in time, given the correct implementation and computational power.
A number of solutions to the problem a two-talker mixture exists.
Some studies into real-time Nonnegative Matrix Factorization (NMF) have provided good results, but don't address window sizes small enough to produce the desired latency performance for hearing aid applications (<20 ms). Likewise, the Probabilistic Latent Component Analysis (PLCA) approach in also claims real-time performance, but operates on frames of length 64 ms, which doesn't satisfy the latency requirements of hearing-aid-users.
Until now, most NMF-based algorithms have been designed to run ‘offline’, however, i.e. the whole mixture signal to be separated/enhanced is available to the processing algorithm at once.
Although some attempts to provide real-time solutions have been reported, there is a need for a solution that give satisfactory results in a hearing device during normal operation.