The present invention, in some embodiments thereof, relates to a method and apparatus for isolation of audio and like sources and, more particularly, but not exclusively, to the use of cross-modal association and/or visual localization for the same.
The term multi-modal signal processing naturally refers to many areas of application. Herein we describe recent relevant studies conducted in the specific field of audio-visual analysis. Studies in this field have been directed at solving many different tasks. Speech analysis is the most common one, since it is an essential tool in many human-computer interfaces. For instance: performing speech recognition in noisy environments can utilize lip images, rather than only speech sounds. This results in an improved performance in speech recognition [6, 65]. Other audio-visual tasks include: source separation based on vision [16, 27, 61]; and video event-detection [66]. Such integration of different modalities is backed by evidence that biological systems also fuse cross-sensory information to enhance their ability to understand their surroundings [22, 24].
Additional background art includes    [2] Z. Barzelay and Y. Y. Schechner. Harmony in motion. Proc. IEEE CVPR (2007).    [3] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. Sandler. A tutorial on onset detection in music signals. In IEEE Trans. Speech and Audio Process., 5:1035{1047 (2005).    [5] S. Birchfield. An implementation of the Kanade-Lucas-Tomasi feature tracker. Available at www.ces.clemson.edu/stb/klt/.    [6] C. Bregler, and Y. Konig Eigenlips for robust speech recognition. In Proc. IEEE ICASSP, vol. 2, pp. 667-672 (1994).    [10] D. Chazan, Y. Stettiner, and D. Malah. Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation. In Proc. IEEE ICASSP, vol. 2, pp. 728{731 (1993).    [12] J. Chen, T. Mukai, Y. Takeuchi, T. Matsumoto, H. Kudo, T. Yamamura, and N. Ohnishi. Relating audio-visual events caused by multiple movements: in the case of entire object movement. Proc. Inf. Fusion, pp. 213-219 (2002).    [13] T. Choudhury, J. Rehg, V. Pavlovic, and A. Pentland. Boosting and structure learning in dynamic bayesian networks for audio-visual speaker detection. In Proc. ICPR., vol. 3, pp. 789-794 (2002).    [16] T. Darrell, J. W. Fisher, P. A. Viola, and W. T. Freeman. Audio-visual segmentation and the cocktail party effect. In Proc. ICMI, pp. 1611-3349 (2000).    [27] J. Hershey and M. Casey. Audio-visual sound separation via hidden markov models. Proc. NIPS, pp. 1173-1180 (2001).    [28] J. Hershey and J. R. Movellan. Audio vision: Using audio-visual synchrony to locate sounds. Proc. NIPS, pp. 813-819 (1999).    [34] Y. Ke, D. Hoiem, and R. Sukthankar. Computer vision for music identification. Proc. IEEE CVPR, vol. 1, pp. 597-604 (2005).    [35] E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. Proc. IEEE CVPR, vol. 1, pp. 88-95 (2005).    [37] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge. Proc. IEEE ICASSP, vol. 6, pp. 3089-3092 (1999).    [43] G. Monaci and P. Vandergheynst. Audiovisual gestalts. Proc. IEEE Worksh. Percept. Org. in Comp. Vis. (2006).    [48] T. W. Parsons. Separation of speech from interfering speech by means of harmonic selection. Journal of the Acoustical Society of America, 60:911-918 (1976). Cliffs, N.J.: Prentice-Hall (1978).    [53] S. Rajaram, A. Nefian, and T. Huang. Bayesian separation of audio-visual speech sources. Proc. IEEE ICASSP, vol. 5, pp. 657-660 (2004). Spatio-temporal Analysis. ACM Multimedia, (2003).    [55] S. Ravulapalli and S. Sarkar Association of Sound to Motion in Video using Perceptual Organization. Proc. IEEE ICPR, pp. 1216-1219 (2006).    [57] S. T. Roweis. One microphone source separation. Proc. NIPS, pp. 793-799 (2001).    [58] Y. Rui and P. Anandan. Segmenting visual actions based on spatio-temporal motion patterns. Proc. IEEE CVPR, vol. 1, pp. 13-15 (2000).    [60] J. Shi and C. Tomasi. Good features to track. Proc. IEEE CVPR, pp. 593-600 (1994).    [61] P. Smaragdis and M. Casey. Audio/visual independent components. Proc. ICA, pp. 709-714 (2003).    [63] T. Syeda-Mahmood Segmenting Actions in Velocity Curve Space. Proc. ICPR, vol. 4 (2002).    [64] C. Tomasi and T. Kanade Detection and Tracking of Point Features. Carnegie Mellon University Technical Report CMU-CS-91-132, April 1991.    [65] M. J. Tomlinson, M. J. Russell and N. M. Brooke. Integrating audio and visual information to provide highly robust speech recognition. Proc. IEEE ICASSP, vol. 2, pp. 821-824 (1996).    [66] Y. Wang, Z. Liu and J. C. Huang 2004, Multimedia content analysis-using both audio and visual clues. IEEE Signal Processing Magazine, 17:12-36 (2004).    [69] O. Yilmaz and S. Rickard. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Sig. Process., 52:1830-1847 (2004).