1. Technical Field
The invention is related to a system and process for automatically generating a reliable vision-based tracking system, and more particularly, to a system and process for using information gathered from an initial object tracking system to automatically learn an object model tailored to at least one specific target object to create a tracking system more reliable than the initial object tracking system.
2. Related Art
Most current systems for determining the presence of objects of interest in an image of a scene have involved processing a temporal sequence of color or grayscale images of a scene using a tracking system. Objects are typically recognized, located and/or tracked in these systems using, for example, color-based, edge-based, shape-based, or motion-based tracking schemes to process the images.
While the aforementioned tracking systems are useful, they do have limitations. For example, such object tracking systems typically use a generic object model having parameters that roughly represent an object for which tracking is desired in combination with a tracking function such as, for example, a color-based, edge-based, shape-based; or motion-based tracking function. In general, such object tracking systems use the generic object model and tracking function to probabilistically locate and track at least one object in one or more sequential images.
As the fidelity of the generic object model increases, the accuracy of the tracking function also typically increases. However, it is not generally possible to create a single high fidelity object model that ideally represents each of the many potential derivatives or views of a single object type, such as the faces of different individuals having different skin coloration, facial structure, hair type and style, etc., under any of a number of lighting conditions. Consequently, such tracking systems are prone to error, especially where the actual parameters defining the target object deviate in one or more ways from the parameters defining the generic object model.
However, in an attempt to address this issue, some work has been done to improve existing object models. For example, in some facial pose tracking work, 3D points on the face are adaptively estimated or learned using Extended Kalman Filters (EKF) [1,6]. In such systems, care must be taken to manually structure the EKF correctly [3], but doing so ensures that as the geometry of the target face is better learned, tracking improves as well.
Other work has focused on learning the textural qualities of target objects for use in tracking those objects. In the domain of facial imagery, there is work in which skin color has been modeled as a parametrized mixture of n Gaussians in some color space [7, 8]. Such work has covered both batch [7] and adaptive [8] learning with much success. These systems typically use an expectation-maximization learning algorithm for learning the parameters, such as skin color, associated with specific target objects.
Although color distributions are a gross quality of object texture, learning localized textures of target objects is also of interest. Consequently, other work has focused on intricate facial geometry and texture, using an array of algorithms to recover fine detail [4] of the textures of a target object. These textures are then used in subsequent tracking of the target object.
Finally, work has been done in learning the dynamic geometry, i.e. the changing configuration (pose or articulation), of a target. The most elementary of such systems use one of the many variations of the Kalman Filter, which xe2x80x9clearnsxe2x80x9d a target""s geometric state [2]. In these cases, the value of the learned model is fleeting since few targets ever maintain constant dynamic geometries. Other related systems focus on models of motion. Such systems include learning of multi-state motion models of targets that exhibit a few discrete patterns of motion [5, 9].
However, the aforementioned systems typically require manual intervention in learning or fine-tuning those tracking systems. Consequently, it is difficult or impossible for such systems to quickly respond to the dynamic environment often associated with tracking moving target objects under possibly changing lighting conditions. Therefore, in contrast to the aforementioned systems, what is needed is a system and process for automatically learning a reliable tracking system during tracking without the need for manual intervention and training of the automatically learned tracking system. The system and process according to the present invention resolves the deficiencies of current locating and tracking systems by automatically learning, during tracking, a more reliable tracking system tailored to specific target objects under automatically observed conditions.
It is noted that in the preceding paragraphs, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. Multiple references are identified by a pair of brackets containing more than one designator, for example, [5, 6, 7]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention involves a new system and process for automatically learning an object model for use in a vision-based tracking system. To address the issue of model fidelity with respect to specific target objects, the learned object model is automatically tailored to represent one or more specific target objects, such as, for example, specific aircraft, cars, people, animals, faces, or any other object in a temporal sequence of at least one image. Learning of the object model is accomplished by automatically determining probabilistic relationships between target state estimates produced by an initial generic tracking system and observations gathered from each image. The learned object model is then employed with a final tracking function to produce an improved tracking system more accurate than the initial generic tracking system.
In general, the system and method of the present invention automatically generates a reliable tracking system by using an initial object model in combination with an initial tracking function to process a temporal sequence of images, and a data acquisition function for gathering observations about each image. Further, in one embodiment, these observations are associated with a measure of confidence that represents the belief that the observation is valid. This measure of confidence may be used to weight the observations. Observations gathered by the data acquisition function are relevant to parameters or variables desired for a learned or final object model. These relevant observations may include information such as the color, shape, texture, size, or any other visual or geometric characteristics of a tracked object, and depend on the parameters necessary to support a known final tracking function. These relevant observations are used by a learning function in combination with the output of the initial tracking function for automatically learning an object model automatically tailored to a specific target object.
The final tracking function may be the same as the initial tracking function, or may be entirely different. For example, both the initial and final tracking function may use edge detection methods to locate target objects in an image. Alternately, the initial tracking function may use color-based detection methods while the final tracking function may use shape-based detection methods to locate target objects in an image. Thus, any type or combination of tracking methods may be used for the initial and final tracking functions.
Data output from the initial tracking function, in combination with the observations generated by the data acquisition function, are fed to the learning function. The learning function then processes the data and observations using a conventional learning method to learn a final object model. Such learning methods include, for example, neural networks, Bayesian belief networks (BBN), discrimination functions, decision trees, expectation-maximization on mixtures of Guassians, probability distribution functions (PDF), estimation through moment computation, PDF estimation through histograms, etc. Once the final object model is learned, the parameters defining this final object model are provided to the final tracking function which processes a temporal sequence of one or more images to accurately locate and track one or more target objects in each image.
The system and method of the present invention operates in two generic cases. First, the invention may be used to improve the tracking accuracy of a tracking system comprising an initial object model, and identical initial and final tracking functions, by automatically tailoring a final object model for use with the final tracking function to better represent one or more specific target objects in a sequence of at least one image. Second, the invention may be used to improve the accuracy-of a tracking system comprising an initial object model, and different initial and final tracking functions by automatically tailoring a final object model for use with the final tracking function to better represent one or more specific target objects in a sequence of at least one image.
Specifically, the system and method of the present invention includes an initial tracking function that accepts the parameters defining the initial model, in combination with one or more sequential images, and outputs a state estimate for each image. This state estimate is a probability distribution over the entire range of configurations that the target object may undergo, wherein higher probabilities denote a greater likelihood of the particular target object configuration. The target configuration typically contains not only position and orientation information about the target object, but also other parameters relevant to the geometrical configuration of the target object such as, for example, geometric descriptions of the articulation or deformation of non-rigid target objects. Multiple targets may be handled by assigning a separate tracking system to each target (where, for example, each tracking system may focus on a single local peak in the probability distribution), or by allowing separate tracking functions to generate a different probability distribution per image, based on distinct characteristics of each of the targets. In the case where multiple target objects are identified, individual object models are created or refined for each target object by individually processing each target object as described below for the case of a single target object. Alternatively, a single model representing all identified target objects may be created or refined, again, as described below for the case of a single target object.
The data acquisition function collects observations or data from each image that will be useful in developing the final object model. This data acquisition function is specifically designed to collect observations relevant to the parameters required by the tracking function with which the learned object model will be used. Typically, the data acquisition function collects observations from the image over the entire configuration space of the target. However, in alternate embodiments, the region of configuration space over which observations are gathered is limited. Limiting the size of this region tends to reduce processing time. Thus, in one embodiment, the state estimate generated by the initial tracking function is used by the data acquisition function such that observations will be made regarding only those portions of the configuration space having a predefined minimum threshold probability of target object identification. In another embodiment, observations from the data acquisition function are collected in only those regions of the target configuration space which are likely to be occupied by the target based on methods such as, for example, dynamic target prediction. In each embodiment, the observations are then provided to the learning function.
In one example, the initial and final tracking functions use an identical head-pose tracking method. Thus, the data acquisition function may be designed to return observations such as, for example, eye, nose and mouth texture qualities with respect to an initial elliptical head model. In a second example, the initial tracking function is based on tracking of heads using a tracking function for detecting head shaped ellipses, and the final tracking function is based on tracking of heads based on detection of skin color. In this second example, the data acquisition function may be designed to return observations of color in particular regions of head shaped ellipses located by the initial tracking function.
As discussed previously, the learning function uses one of the aforementioned learning methods to automatically learn and output a final object model using a combination of the state estimates generated by the initial tracking function and the observations generated by the data acquisition function. Further, in one embodiment, the learning function also employs a partial or complete preliminary object model as a baseline to assist the learning function in better learning a probabilistically optimal object model. The preliminary object model is a tentative object model comprised of generic parameters designed to roughly represent an expected target object. The preliminary object model may be a complete or a partial model, or may initially be blank. One example of a partial object model, with respect to head tracking, is the back of the head, which is typically a relatively featureless elliptical shape having a relatively uniform color. The learning function combines this partial model with information learned about the sides and front of the head, based on data input to the learning function from the initial tracking function and the data acquisition function, to learn the final object model. However, while the use of the preliminary object model may allow the learning function to more quickly or more accurately learn a final object model, the use of a preliminary object model is not required for automatically learning an accurate object model.
In general, the learning function uses automated methods for identifying variable probabilistic dependencies between the state estimates, observations, and preliminary object model, if used, to discover new structures for a probabilistic model that is more ideal in that it better explains the data input to the learning function. Consequently, the learning function is able to learn the probabilistic model best fitting all available data. This probabilistic model is then used by the learning function to output the final object model. The variable probabilistic dependencies identified by the learning function tend to become more accurate as more information is provided to the learning function.
The initial tracking function and the data acquisition function preferably process a predetermined number of images before the learning function outputs the final object model. The number of images that must be processed before the learning function outputs a final object model is dependent upon the form of the initial tracking function. For example, where a motion-based initial tracking function is used, at least two sequential images will likely need to be processed by the initial tracking function and the data acquisition function before the learning function can output a learned final object model. However, where the initial tracking function uses color or edge-based detection techniques, the learning function can output a learned final object model after a single image has been processed.
The final object model learned by the learning function is comprised of the parameters required by the final tracking function to locate and track a target object in an image. Thus, the primary use for the final object model is to provide parameters to the final tracking function for use in processing one or more sequential images. However, the final object model may also be used in several other ways to improve overall tracking system accuracy.
First, in one embodiment, the learned final object model may be iteratively fed back into the learning function in place of the generic preliminary object model described above. This effectively provides a positive feedback for weighting parameters likely to be associated with either the target object or background within each image. Similarly, in the aforementioned embodiment where the preliminary object model is not used, the learned object model may also be iteratively provided to the learning function. Essentially, in either case, this iterative feedback process allows the learned object model to be iteratively provided to the learning function as soon as that object model is learned. The learning function then continues to learn and output a learned object model which evolves over time as more information is provided to the learning function. Consequently, over time, iterative feedback of the learned object model into the learning function serves to allow the learning function to learn an increasingly accurate final object model.
Second, in an embodiment where the initial and final tracking functions are identical, the final object model output by the learning function may be used to iteratively replace the initial object model. In this manner, the accuracy of the state estimate generated by the initial tracking function is improved. Consequently, this more accurate state estimate, in combination with the observations-generated by the data acquisition function, again allows the learning function to learn an increasingly accurate final object model.
Third, in still another embodiment, the two embodiments described above may be combined, where the initial and final tracking functions are identical, to iteratively replace both the initial object model and the generic preliminary object model with the final object model output by the learning function. In this manner, both the accuracy of the state estimate generated by the initial tracking function and the accuracy of the learning function are improved. Consequently, the more accurate state estimate, in combination with the more accurate learning function, again allows the learning function to learn an increasingly accurate final object model.
Fourth, in a further embodiment, where the initial and final tracking functions are different, the final object model may be used to iteratively replace the initial object model, while the final tracking function is used to replace the initial tracking function. In this manner, both the accuracy of the state estimate generated by the initial tracking function and the accuracy of the learning function are improved. Consequently, the more accurate state estimate, in combination with the more accurate learning function again allows the learning function to learn an increasingly accurate final object model.
The final tracking function accepts the parameters defining the final object model, in combination with one or more sequential images and outputs either a final state estimate for each image, or simply target object configuration information with respect to each image. As with the state estimate output by the initial tracking function, this final state estimate is a probability distribution over the entire configuration range of the target wherein higher probabilities denote a greater likelihood of target object configuration. As discussed above, the final object model may be iteratively updated, thereby increasing in accuracy. Consequently, the accuracy of the state estimate or position information output by the final tracking function increases over time as the accuracy of the final object model increases.
In a further embodiment of the present invention, the process described above for learning the final object model may be generalized to include learning of any number of subsequent or xe2x80x9cfinalxe2x80x9d object models. For example, the final object model and tracking function described above may be used as an initial starting point in combination with a subsequent data acquisition function and a subsequent learning function to learn a subsequent object model. Clearly, this process may be repeated for as many levels as desired to generate a sequence of increasingly accurate tracking systems based on increasingly accurate learned object models.
In addition to the just described benefits, other advantages of the present invention will become apparent from the detailed description which follows hereinafter when taken in conjunction with the drawing figures which accompany it.