In many situations, it is desirable to selectively listen to one of several audio sources that are interfering with each other. This source separation problem is often referred to as the “cocktail party problem”, since it can arise in that context for people having conversations in the presence of interfering talk. In signal processing, the source separation problem is often formulated as a problem of deriving an optimal estimate (e.g., a maximum likelihood estimate) of the original source signals given the received signals exhibiting interference. Multiple receivers are typically employed.
Although the theoretical framework of maximum likelihood (ML) estimation is well known, direct application of ML estimation to the general audio source separation problem typically encounters insuperable computational difficulties. In particular, reverberations typical of acoustic environments result in convolutive mixing of the interfering audio signals, as opposed to the significantly simpler case of instantaneous mixing. Accordingly, much work in the art has focused on simplifying the mathematical ML model (e.g., by making various approximations and/or simplifications) in order to obtain a computationally tractable ML optimization. Although such an ML approach is typically not optimal when the relevant simplifying assumptions do not hold, the resulting practical performance may be sufficient. Accordingly, various simplified ML approaches have been investigated in the art.
For example, instantaneous mixing is considered in articles by Cardoso (IEEE Signal Processing Letters, v4, pp 112-114, 1997), and by Bell and Sejnowski (Neural Computation, v7, pp 1129-1159, 1995). Instantaneous mixing is also considered by Attias (Neural Computation, v11, pp 803-851, 1999), in connection with a more general source model than in the Cardoso or Bell articles.
A white (i.e., frequency independent) source model for convolutive mixing is considered by Lee et al. (Advances in Neural Information Processing Systems, v9, pp 758-764), and a filtered white source model for convolutive mixing is considered by Attias and Schreiner (Neural Computation, v10, pp 1373-1424, 1998). Convolutive mixing for more general source models is considered by Acero et al (Proc. Intl. Conf. on Spoken Language Processing, v4, pp 532-535, 2000), by Parra and Spence (IEEE Trans. on Speech and Audio Processing, v8, pp 320-327, 2000), and by Attias (Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, 2003).
Various other source separation techniques have also been proposed. In U.S. Pat. No. 5,208,786, source separation based on requiring a near-zero cross-correlation between reconstructed signals is considered. In U.S. Pat. Nos. 5,694,474, 6,023,514, 6,978,159, and 7,088,831, estimates of the relative propagation delay between each source and each detector are employed to aid source separation. Source separation via wavelet analysis is considered in U.S. Pat. No. 6,182,018. Analysis of the pitch of a source signal to aid source separation is considered in U.S. 2005/0195990.
Conventional source separation approaches (both ML methods and non-ML methods) have not provided a complete solution to the source separation problem to date. Approaches which are computationally tractable tend to provide inadequate separation performance. Approaches which can provide good separation performance tend to be computationally intractable. Accordingly, it would be an advance in the art to provide audio source separation having an improved combination of separation performance and computational tractability.