1. Technical Field
The invention is related to systems and processes for tracking an object state over time using sensor fusion techniques, and more particularly to such a system and process having a two-level, closed-loop, particle filter sensor fusion architecture.
2. Background Art
Sensor fusion for object tracking has become an active research topic during the past few years. But how to do it in a robust and principled way is still an open problem. The problem is of particular interest in the context of tracking the location of a speaker. Distributed meetings and lectures have been gaining in popularity [4, 14]. A key technology component in these systems is a reliable speaker tracking module. For instance, if the system knows the speaker location, it can dynamically point a camera so that the speaker stays within the view of a remote audience. There are commercial video conferencing systems that provide speaker tracking based on audio sound source localization (SSL). While tracking the speaker using SSL can provide a much richer experience to remote audiences than using a static camera, there is significant room for improvement. Essentially, SSL techniques are good at detecting a speaker, but do not perform well for tracking, especially when the person of interest is not constantly talking.
A more reliable speaker tracking technique involves the fusion of high-performance audio-based SSL with vision-based tracking techniques to establish and track the location of a speaker. This type of sensor fusion is reported in [2] and [13]. It is noted that the term “sensor” is used herein in a generalized way. It represents a logical sensor instead of a physical sensor. For example, both vision-based contour and color tracking techniques would be considered logical sensors for the purposes of the present tracking system and process, but are based on the same physical sensor—i.e., a video camera. In addition, sensors can perform different tasks depending on the complexity of the sensor algorithms. For example, some sensors perform tracking and are called trackers, while others merely perform verification (e.g., computing a state likelihood) and are called verifiers.
In general, there are two existing paradigms for sensor fusion: bottom-up and top-down. Both paradigms have a fuser and multiple sensors. The bottom-up paradigm starts from the sensors. Each sensor has a tracker and it tries to solve an inverse problem—namely estimating the unknown state based on the sensory data. To make the inverse problem tractable, assumptions are typically made in the trackers. For example, system linearity and Gaussianality are assumed in conventional Kalman-type trackers. These assumptions significantly reduce the problem complexity and the trackers can run in real time. Once the individual tracking results are available, relatively simple distributed sensor networks [1] or graphical models [15] are used to perform the sensor fusion task. While the assumptions make the problem tractable, they inherently hinder the robustness of the bottom-up techniques.
The top-down paradigm, on the other hand, emphasizes the top. Namely, it uses intelligent fusers but simple sensors (e.g., verifiers) [17, 9]. This paradigm solves the forward problem, i.e., evaluating a given hypothesis using the sensory data. First, the fuser generates a set of hypotheses (also called particles) to cover the possible state space. All the hypotheses are then sent down to the verifiers. The verifiers compute the likelihood of the hypotheses and report back to the fuser. The fuser then uses weighted hypotheses to estimate the distribution of the object state. Note that it is usually much easier to verify a given hypothesis than to solve the inverse tracking problem (as in the bottom-up paradigm). Therefore, more complex object models (e.g., non-linear and non-Gaussian models) can be used in the top-down paradigm. This in turn results in more robust tracking. There is, however, inefficiency with this paradigm. For example, because the sensors have verifiers instead of trackers, they do not help the fuser to generate good hypotheses. The hypotheses are semi-blindly generated [17], and some can represent low-likelihood regions—thus lowering efficiency [10]. Further, in order to cover the state space sufficiently well, a large number of hypotheses are needed, and this requires extensive computing power.
Thus, the bottom-up paradigm can provide fast tracking results, but at the expense of simplified assumptions. On the other hand, the top-down paradigm does not require simplified assumptions but needs extensive computation because the hypotheses can be very poor. Furthermore, a common drawback with both paradigms is that they are open-loop systems. For example, in the bottom-up paradigm, the fuser does not go back to the tracker to verify how reliable the tracking results are. Similarly, in the top-down paradigm, the sensors do not provide cues to the fuser to help generate more effective hypotheses. The present speaker tracking system and process provides a novel sensor fusion framework that utilizes the strength of both paradigms while avoiding their limitations.
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, “reference [1]” or simply “[1]”. Multiple references will be identified by a pair of brackets containing more than one designator, for example, [2, 3]. A listing of references including the publications corresponding to each designator can be found at the end of the Detailed Description section.