There is a large class of applications that depend upon the ability to localize a model of an object in an image, a task known as xe2x80x9cregistration.xe2x80x9d These applications can be roughly categorized into detection, alignment, and tracking problems.
Detection problems involve, for example, finding objects in image databases or finding faces in surveillance video. The model in a detection problem is usually generic, describing a class of objects. For example, in a prior art face detection system, the object model is a neural network template that describes all frontal, upright faces. See Rowley et al., xe2x80x9cNeural network-based face detectionxe2x80x9d, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), pages 23-38, January 1998. Another example is locating armored vehicles in images for a military targeting system.
An example of an alignment application is mosaicing, in which a single large image is constructed from a series of smaller overlapping images. In this application, each model is simply an image to be added incrementally to the mosaic. The alignment goal is to position each new image so that it is consistent with the current mosaic wherever the two overlap. A description is given in Irani et al., xe2x80x9cMosaic based representations of video sequences and their applications,xe2x80x9d Proceedings of Int. Conference on Computer Vision, pages 605-611, Cambridge, Mass., 1995.
Another example is the alignment of plural images obtained from different sensors, e.g. aligning remote-sensed images obtained via normal and infra-red photography, or aligning MRI and SPECT medical images. This allows different regions of an image to be analyzed via multimodal (i.e., vector) measurements instead of scalar pixel intensities. These and other applications are further discussed in the survey on image registration, Brown, xe2x80x9cA survey of image registration techniques,xe2x80x9d ACM Computing Surveys, 24(4), pages 325-376, 992.
In tracking applications, the models are typically specific descriptions of an image object that is moving through a video sequence. Examples include tracking people for surveillance or user-interface purposes. In figure tracking for surveillance, a stick-figure model of a person evolves over time, matched to the location of a person in a video sequence. A representative prior method is Cham et al., xe2x80x9cA multiple hypothesis approach to figure tracking,xe2x80x9d Proceedings Computer Vision and Pattern Recognition, pages 239-245, Fort Collins, Colo., 1999. In user-interface applications, the user""s gaze direction or head pose may be tracked to determine their focus-of-attention. A prior method is described in Oliver et al., xe2x80x9cLAFTER: Lips and face real time tracker,xe2x80x9d Proceedings Computer Vision and Pattern Recognition, pages 123-129, San Juan, PR, Jun. 17-19, 1997.
In each of these application areas, there is a desire to handle increasingly sophisticated object models, which is fueled by the increasing demand for sensing technologies. For example, modern user interfaces may be based on tracking the full-body pose of a user to facilitate gesture recognition. As the complexity of the model increases, the computational cost of registration rises dramatically. A naive registration method such as exhaustive search would result in a slow, inefficient system for a complex object like the human figure. However a fast and reliable solution would support advanced applications in content-based image and video editing and retrieval, surveillance, advanced user-interfaces, and military targeting systems.
Therefore, there is a need for a registration method which is computationally efficient in the presence of complex object models.
The invention describes a method for efficiently tracking object models in a video or other image sequence.
Accordingly, tracking an object model in a sequence of frames where the object model comprises a plurality of features and is described by a model state, includes both selecting an unregistered feature of the object model and selecting an available frame from the sequence of frames, to minimize a cost function of a subsequent search. A search is performed for a match of the selected model feature to the image in the selected frame in order to register the feature in that frame. The model state is then updated for each available frame. The steps of selecting, searching and updating are repeated.
In an embodiment where at any given time only one frame is available, and where frames are available in sequential order, features of the object model are iteratively registered in the available frame. Each iteration includes the steps of selecting, searching, updating with respect to the available frame. This step is terminated , and the next frame is acquired. A state prior is predicted for the next frame, using a most recent state update. Finally, the steps of iteratively registering, terminating, acquiring and predicting, are repeated. Upon each repetition, features are registered responsive to the state prior predicted by the previous repetition.
Iteratively registering features can include selecting an unregistered feature of the object model to minimize a cost function of a subsequent search. A search is performed for a match of the selected model feature to the image to register the feature. The model state is updated. Finally, the steps of selecting, searching and updating are repeated.
A list of model features to be matched is maintained. Each listed model feature is associated with an indicator which provides an indication as to whether the respective model feature is available for matching. A feature is marked as unavailable when it is matched. All features are marked as available upon the acquisition of a new frame.
Determining when to advance to a next frame may be based on, for example, the number of unmatched model features, or the amount of time elapsed while iteratively registering features for a current frame.
In a particular embodiment of the present invention, a list of  less than feature, frame greater than  pairs which have been matched is maintained.
In at least one embodiment, all frames of the sequence of frames are available.
In one embodiment, for each available frame in the sequence, features are extracted from the frame, and searching for a match employs feature-to-feature matching.
In another embodiment, searching for a match employs feature-to-image matching.
In one feature-to-image matching embodiment, each available frame in the sequence is preprocessed, and the number of image regions to search is restricted. Preprocessing may include identifying regions of at least one predetermined color, for example a skin color, such that restricting the number of image regions to search comprises searching only the identified regions.
Alternatively, preprocessing may include examining the local spatial-frequency content of the frame""s image, and identifying regions in which to search based on the local spatial-frequency content.
All steps may be performed off-line.
A search window may be defined which specifies a range of frames from which a feature can be selected. The search window may include all available frames, or it may include a subset, such as five frames, including the most recently acquired frame.
In one embodiment, the feature associated with a lowest cost is selected.
Alternatively, any feature which is associated with a cost which is less than some threshold may be selected. For each unregistered feature of each available frame, a cost is determined of search operations required to find a match with at least a predetermined probability, until a feature is found which has an associated cost less than the threshold, and that feature is selected. If no feature is found which has an associated cost less than the threshold, then a feature with the lowest determined cost may be selected.
To select a feature, a list of features is maintained. A minimum cost, such as xe2x88x921, is assigned to a feature which has an associated cost less than the predetermined threshold. The list is then ordered according to the determined cost, such that a feature with the lowest determined cost is listed at the top of the list.
After searching for a match of the selected feature and updating the model, the cost is recalculated only for features affected by the state update.
The threshold may be, for example, a predetermined threshold, or it may be an adaptive threshold.
An update window may be defined which specifies a range of frames for which the model state is updated. The update window may be centered around a frame in which the most recent matching occurred, or alternatively, may span several multiples of a dominant time constant, for example, between two and seven multiples.
In one embodiment of the present invention, all steps are performed on-line. The frames are provided by a source such as a video source, e.g., a video camera. A signal may be provided to the video source to acquire a next frame. Frames may be provided from the video source at a fixed rate, for example, 30 frames per second.
Upon the acquisition of a new frame from the source, a determination is made as to whether to use the new frame. If it is determined to use the new frame, a new state vector is added to a state sequence, and initialized based on a previous set of measurements.
In one embodiment, the sequence of frames is a video sequence, and features may be attributes of an object appearance.
However, the image in each frame is not necessarily a video or even a picture-based image. In one embodiment, for example, the sequence of frames is an audio sequence, and features may be elements of a speech signal. Thus, the xe2x80x9cimagexe2x80x9d is an audio image.
In yet another embodiment, the sequence of frames is a sequence of genetic data, and features may be biological markers. Here, the xe2x80x9cimagexe2x80x9d may be considered to be the genetic data.
In at least one embodiment, the cost function is based on the feature""s basin of attraction, and may be further based on the complexity of searching at each basin of attraction.
Preferably, searching is performed in a region of high probability of a match. A search region may be based on a projected state probability distribution.
Searching is based on maximizing a comparison function.
Selecting and searching are preferably responsive to a propagated state probability distribution. The state probability distribution is preferably projected into feature space.
In at least one embodiment, selecting includes determining, for each unregistered feature, the number of search operations required to find a match with at least a predetermined probability, and selecting a feature requiring the least number of search operations.
Determining the number of search operations for a feature may include determining search regions within a feature space, where each region has an associated probability density. The number of required search operations is then computed based on the determined search regions.
Determining search regions may include finding search regions within the feature space such that each region""s associated probability density exceeds a predetermined threshold. The probabilities associated with each of the found search regions are summed to form a total probability. While the total probability is less than a predetermined probability, the threshold is lowered and the steps of finding search regions and summing the probabilities are repeated.
Searching may include feature-to-feature matching, where the number of search operations is the number of target features located within each search region. The number of target features located within the search region may be based on Mahalanobis distances to potential target features.
Target features may be approximately uniformly distributed, and the number of features may be proportional to the search region""s size. The features are then ranked according to the sizes of the associated search regions.
Searching may alternatively include feature-to-image matching, wherein computing the number of required search operations includes, for each search region, dividing the region into minimally-overlapping volumes having a same size and shape as a basin of attraction associated with the feature, and counting the number of such volumes required to cover the regions.
Counting volumes may be approximated by obtaining eigenvalues and eigenvectors to a covariance matrix associated with the feature search region, calculating a basin of attraction span for each eigenvector direction, and approximating the count responsive to the eigenvalues and the spans.
Model states may be updated according to a propagated state probability distribution. Furthermore, the propagation of the probability distribution may be based on successive registered features.
The state probability model may have a Gaussian distribution, and may be propagated using a Kalman filter update step.
In one embodiment, the step of repeating continues only until a predetermined level of certainty in an estimate of the model is achieved, such that some of the available features are not registered.
A training set of registration tasks may be provided, in which case, for each registration task, an optimal feature ordering is determined by performing the steps of selecting, searching, updating and repeating. Responsive to the optimal feature orderings, a fixed ordering is determined. Finally, the object model is registered in the image using the fixed ordering.
A system for tracking an object model in a sequence of frames includes a feature selection module which selects an unregistered feature of the object model and an available frame from the sequence, to minimize a cost function of a subsequent search, a search module which searches for a match of the selected model feature to the image to register the feature, and an update module which updates the model state for each available frame based on the match found by the search module.
An embodiment in which, at any given time, only one frame is available, frames being available in sequential order, further includes an acquisition module for acquiring sequence frames, and a process control module which signals the feature selection module to terminate, and which signals the acquisition module to make available a next frame.
In each feature matching cycle, the feature with the smallest matching ambiguity among all model features in all frames maintained in a video store is selected. This process is known as xe2x80x9cspatiotemporal feature selection.xe2x80x9d
After registration of the selected feature, the matching ambiguities of unregistered features in neighboring past and future frames may be changed through smoothing dynamics and may have to be recomputed for the following feature selection cycle.
When computing the matching ambiguities for all features in each feature selection cycle is too costly, xe2x80x9cthresholdxe2x80x9d and xe2x80x9cinvariance heuristicsxe2x80x9d are used to update matching ambiguities.
In one embodiment, the video store may contain only the frame corresponding to the current time instance. In this case, the framework used is based on iterated sequential feature selection. A process control module determines when the next frame should be acquired into the video store.
In a second embodiment, the video store may contain the entire video sequence. In this case, tracking is off-line and the spatiotemporal feature selection is performed across all video frames for maximum tracking efficiency. The state sequence update module computes the model states in all frames simultaneously, and also applies smoothing according to the dynamic model.
In a third embodiment, the video store contains a small number of previous frames and the current frame. Here, tracking is done online. The spatiotemporal feature selection spans only a subset of the entire video sequence at each feature matching cycle. The tradeoff is the reduced efficiency of tracking.
In addition, when directly matching features to images, efficiency can be further enhanced by using a preprocessing step which determines which portions of a video frame should be searched.
This application is related to U.S. application Ser. No. 09/466,975, filed Dec. 20, 1999, and Ser. No. 09/466,970 filed Dec. 20, 1999, the entire teachings of which are incorporated herein by reference.