1. Field of the Invention
The invention relates to a method and apparatus of target detection, and more particularly, to a method and apparatus that can detect, localize, and track multiple target objects observed by audio and video sensors where the objects can be concurrent in time, but separate in space.
2. Description of the Related Art
Generally, when attempting to detect a target, existing apparatuses and method rely either on visual or audio signals. For audio tracking, time-delay estimates (TDE) are used. However, even though there is a weighting function from a maximum likelihood approach and a phase transform to cope with ambient noises and reverberations, TDE-based techniques are vulnerable to contamination from explicit directional noises.
As for video tracking, object detection can be performed by comparing images using Hausdorff distance as described in D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Comparing Images using the Hausdorff Distance under Translation,” in Proc. IEEE Int. Conf. CVPR, 1992, pp. 654-656. This method is simple and robust under scaling and translations, but consumes considerable time to compare all the candidate images of various scales.
Additionally, there is a further problem in detecting and separating targets where there is overlapping speech/sounds emanating from different targets. Overlapping speech occupies a central position in segmenting audio into speaker turns as set forth in E. Shriberg, A. Stolcke, and D. Baron, “Observations on Overlap: Findings and Implications for Automatic Processing of Multi-party Conversation,” in Proc. Eurospeech, 2001. Results on segmentation of overlapping speeches with a microphone array are reported by using binaural blind signal separation, dual-speaker hidden Markov models, and speech/silence ratio incorporating Gaussian distributions to model speaker locations with time delay estimates. Examples of these results are set forth in C. Choi, “Real-time Binaural Blind Source Separation,” in Proc. Int Symp. ICA and BSS, pp. 567-572, 2003; G. Lathoud and I. A. McCowan, “Location based Speaker Segmentation,” in Proc. ICASSP, 2003; G. Lathoud, I. A. McCowan, and D. C. Moore, “Segmenting Multiple Concurrent Speakers using Microphone Arrays,” in Proc. Eurospeech, 2003. Speaker tracking using a panoramic image from a five video stream input and a microphone array is reported in R. Cutler et. al., “Distributed Meetings: A Meeting Capture and Broadcasting System,” in Proc. ACM Int. Conf. Multimedia, 2002 and Y. Chen and Y. Rui, “Real-time Speaker Tracking using Particle Filter Sensor Fusion,” Proc. of the IEEE, vol. 92, no. 3, pp. 485-494, 2004. These methods are the two extremes of concurrent speaker segmentation: one approach depends solely on audio information while the other approach depends mostly on video.
However, neither approach effectively uses video and audio inputs in order to separate overlapped speech. Further, the method disclosed by Y. Chen and Y. Rui uses a great deal of memory since all of the received audio data is recorded, and does not separate each speech among multiple concurrent speeches using the video and audio inputs so that a separated speech is identified as being from a particular speaker.