Single-modality systems for localizing and tracking moving objects based only on either audio or visual cues are known. These single-modality systems are not suitable for multimedia applications such as, for example, teleconferencing, multimedia kiosks and interactive games where accurate localizing and tracking of an object or audio-visual source, typically a person, is important.
For example, in Rashid, R. F., "Towards A System For The Interpretation Of Moving Light Displays", 2 IEEE Transactions on Pattern Analysis and Machine Intelligence, 574-581 (1980), a method is described for interpreting moving light displays (MLD). In general, Rashid teaches segmenting out from MLD images individual points corresponding to moving people. The individual points are grouped together to form clusters based on, inter alia, the positions and velocities of the individual points; the formed clusters represent individual objects. Tracking is performed by matching points between consecutive frames based on the relative distances between the location of points in the current frame and the location of predicted points in a previous frame. The predicted position is based on the average velocity of the point in the previous frame and the relative distance, which is calculated using a Euclidean function.
The technique described by Rashid has several drawbacks. Specifically, the MLD system requires several frames before a good object separation is obtained, and no criteria is provided for determining when satisfactory object separation has occurred. In addition, no mechanism is provided for propagating the generated clusters to prior and subsequent frames for continuity in the motion representation.
In another camera-only tracking system described by Rossi, M. and Bozzoli, A., in "Tracking And Counting Moving People", Proceedings Of The Second IEEE International Conference On Image Processing, 212-16 (1994), a vertically mounted camera is employed for tracking and counting moving people. This system operates under the assumption that people enter a scene along either the top or bottom of the image where alerting zones are positioned for detecting people moving into the scene. A major drawback of this system, however, is that in reality people can also appear in a scene, inter alia, from behind another object or from behind an already-identified person. In other words, people may be wholly or partially occluded upon initially entering a scene and such persons would not be identified by this system. The problem of identifying occluded persons is also present in the system described by Rohr, K., in "Towards Model Based Recognition Of Human Movements In Image Sequences", 59 Computer Vision, Graphics And Image Processing: Image Understanding, 94-115 (1994).
In addition, the systems described by Smith, S. M., and Brady, J. M., in "A Scene Segmenter: Visual Tracking of Moving Vehicles", 7 Engineering Applications Of Artificial Intelligence 191-204 (1994); and in "ASSET-2: Real-Time Motion Segmentation And Shape Tracking", 17 IEEE Transactions On Pattern Analysis And Machine Intelligence, 814-20 (1995), are designed specifically for tracking objects such as moving vehicles, and accordingly identify features representing corners or abrupt changes on the boundaries of the vehicles. This approach utilizes the fact and requires that the tracked objects are rigid and, thus, permits the use of constant velocity or constant acceleration models. This technique is clearly unsuitable for the tracking of people.
Localization systems based only on microphones are also inadequate as these systems are highly susceptible to multipath interference in a reverberative environment wherein the microphones receive both the direct-path acoustic waves (directly from the audio source) as well as indirect-path acoustic waves echoing or reverberating off of large surfaces. This problem is further exacerbated if multiple speakers talk simultaneously.
Prior art multimedia conferencing systems employing a single camera and a plurality of microphones are also known. However, in these systems, the microphones are merely used to guide the field of view of a camera toward the speaker seated at a conference table or otherwise predeterminedly positioned relative to the microphones.
For example, U.S. Pat. No. 4,264,928 to Schober discloses a conference video system having a microphone disposed at each of the conference seats arranged in row, and an automatically adjustable mirror disposed above these seats for aiming the camera's field of view toward the speaker. The system utilizes the time two adjacent microphones receive a speaker's voice to generate the requisite control signals used for driving a servomotor to position the mirror. This system, however, does not localize or track a plurality of speakers who may speak simultaneously and move about.
U.S. Pat. No. 5,686,957 to Baker discloses an automatic, voice-directional video camera image steering system for teleconferencing. The system uses a video camera with a hemispheric lens disposed at the center of the conference table to capture a panoramic but distorted video scene around the table. To determine the direction of a speaker relative to the camera lens, the system employs an array of microphones disposed on the table and around the hemispheric lens. An audio detection circuit connected to the microphones provides information concerning the general direction of the speaker so that the system can select and display the appropriate image segment containing the speaker in the proper viewing aspect ratio using view-warping techniques. Manual camera movement or automated mechanical camera movement such as, for example, panning and zooming is thereby eliminated. However, this system, like that of Schober, also does not track or localize a speaker.
It is therefore desirable to provide a robust localization and tracking method and system which overcomes the aforementioned deficiencies of the prior art systems through the integrated use of audio and visual cues and which localizes and tracks a plurality of objects, typically people, some of whom may at times move outside a camera's field of view, speak at the same time or during a period of silence.