1. Field of the Invention
The present invention relates to a system and method for visually tracking objects by fusing results of multiple sensing modalities of a model, and in particular to a model such as a Bayesian network, that can be trained offline from data collected from a sensor, and wherein dependencies considered in the model can be restructured with Bayesian learning methods that identify new dependencies.
2. Related Art
Applications of real-time vision-based object detection and tracking are becoming increasingly important for providing new classes of services to users based on an assessment of the presence, position, and trajectory of objects. Research on computer-based motion analysis of digital video scenes centers on the goal of detecting and tracking objects of interest, typically via the analysis of the content of a sequence of images. Plural objects define each image and are typically nebulous collections of pixels, which satisfy some property. Each object can occupy a region or regions within each image and can change their relative locations throughout subsequent images and the video scene. These objects are considered moving objects, which form motion within a video scene.
Facial objects of a human head, such as mouth, eyes, nose, etc., can be types of moving objects within a video scene. It is very desirable to automatically track movement of these facial objects because successful digital motion analysis of facial movement has numerous applications in real world environments. For example, one application includes facial expression analysis for automatically converting facial expressions into computer readable input for performing computer operations and for making decisions based on human emotions derived from the facial expressions. Another application is for digital speech recognition and xe2x80x9clip readingxe2x80x9d for automatically recognizing human speech without requiring human vocal input or for receiving the speech as computer instructions. Another application is the visual identification of the nature of the ongoing activity of one or more individuals so as to provide context-sensitive informational display, assistance, and communications.
However, current real-time tracking systems, which depend on various visual processing modalities, such as color, motion, and edge information, are often confused by waving hands or changing illumination. Also, specific visual processing modalities may work well in certain situations but fail dramatically in others, depending on the nature of the scene being processed. Current visual modalities, used singularly, are not consistent enough to detect all heads nor discriminating enough to detect heads robustly. Color, for example, changes with shifts in illumination. Yet, xe2x80x9cskin colorxe2x80x9d is not restricted to skin.
As such, in the past, a variety of techniques have been investigated to unify the results of sets of sensors. Recent techniques have attempted to perform real-time head tracking by combining multiple visual cues. One previous technique used variations of a probabilistic data association filter to combine color and edge data for tracking a variety of objects. Another previous technique used priors from color data to bias estimation based on edge data within their framework. Another technique uses edge and color data. Head position estimates are made by comparing match scores based on image gradients and color histograms. The estimate from the more reliable modality is returned. Another technique heuristically integrates color data, range data, and frontal face detection for tracking.
Methods employing dynamic models such as Bayesian networks have the ability to fuse the results of multiple modalities of visual analysis. The structure of such models can be based on key patterns of dependency including subassemblies of the overall dependency model that relate the inferred reliabilities of each modality to the true state of the world. The parameters of these models can be assessed manually through a reliance on expert knowledge about the probabilistic relationships.
Nevertheless, these systems and techniques do not reliably and effectively combine the results of multiple modes of analysis, nor do they make use of ideal parameters that are derived from a consideration of data that can be collected experimentally. Therefore, what is needed is a system and method for training a dynamic model, such as a Bayesian network, to effectively capture probabilistic dependencies between the true state of the object being tracked and evidence from the tracking modalities. Such a system can be used to enhance a model constructed by an expert, or to eliminate the need for a person to assess the ideal parameters of the Bayesian model.
To overcome the limitations in the related art described above, and to overcome other limitations that will become apparent upon reading and understanding the present application, the present invention is embodied in a system and method for training a dynamic model, such as a Bayesian network, to effectively capture probabilistic dependencies between the true state of an object being tracked and evidence from various tracking modalities. The system and method of the present invention fuses results of multiple sensing modalities to automatically infer the structure of a dynamic model, such as a Bayesian network, to achieve robust digital, vision tracking. The model can be trained and structured offline using data collected from a sensor that may be either vision, or non-vision, based in conjunction with position estimates from the sensing modalities. Further, models based on handcrafted structures and probability assessments can also be enhanced by training the models with experimentally derived real-world data.
Automated methods for identifying variable dependencies within the model are employed to discover new structures for the probabilistic dependency models that are more ideal in that they better explain the data. Dependencies considered in the model can be restructured with Bayesian learning methods that identify new dependencies in the model. Further, the model can automatically adapt its position estimates by detecting changes in indicators of reliability of one or more modalities.
In general, context-sensitive accuracies are inferred for fusing the results of multiple vision processing modalities for tracking tasks in order to achieve robust vision tracking, such as head tracking. This is accomplished by fusing together reports from several distinct vision processing procedures. Beyond the reports, evidence with relevance to the accuracy of the reports of each modality is reported by the vision processing modalities.
Evidence about the operating context of the distinct modalities is considered and the accuracy of different modalities is inferred from sets of evidence with relevance to identifying the operating regime in which a modality is operating. In other words, observations of evidence about features in the data being analyzed by the modalities, such as a vision scene, are considered in inferring the reliability of a methods report. The reliabilities are used in the Bayesian integration of multiple reports. Offline training of the model increases the accuracy of the inferences of object position that are derived from the model.
Specifically, dynamic Bayesian modality-accuracy models are built either manually, or automatically by a system and method in accordance with the present invention. Reports from multiple vision processing modalities of the models are fused together with appropriate weighting to infer an objects position. Bayesian network learning algorithms are used to learn the dependencies among variables to infer the structure of the models as well as to restructure and increase the accuracy of the models through training. Structuring and training of the models may be accomplished by providing sets of training cases that incorporate ground truth data obtained by using a sensor to accurately provide object position information, estimates of position produced by each modality, reliability indicators for each modality, and the xe2x80x9cground-truth reliability.xe2x80x9d The ground-truth reliability is a measure of the reliability of position information inferred from each modality with respect to the absolute difference between the position data provided by the sensor and the position estimates inferred by each modality.
The foregoing and still further features and advantages of the present invention as well as a more complete understanding thereof will be made apparent from a study of the following detailed description of the invention in connection with the accompanying drawings and appended claims.