Many existing approaches for tracking an object throughout multiple video sequences recorded by different cameras operate by matching spatial, temporal, and appearance characteristics of the object. Spatial characteristics may include the position of is an object and the size of an object. Temporal characteristics may include the time elapsed between an object leaving a field of view of one camera and the object appearing in a field of view of another camera. Appearance characteristics may include a summary of the luminance or chrominance values of image elements comprising the object. Such a summary may be presented, for example, as a histogram. A summary of an appearance of an object is known as a signature.
One approach for tracking an object using multiple cameras performs tracking independently for a field of view associated with each respective camera. Then, corresponding tracks from each camera representing the same real-world object are determined based on spatio-temporal correspondences. In one method, a spatio-temporal correspondence is maintained between a first location where an object exits the field of view of a first camera and a second location where the object enters the field of view of a second camera after some transit time. Pairs of tracks satisfying this spatio-temporal constraint are considered to correspond to the same real-world object. In multiple-camera-tracking, if the spatio-temporal constraint is satisfied, a pair of tracks with matching signatures can be used to confirm that the tracks match. Otherwise, a new spatio-temporal constraint can be proposed if the signature match is sufficiently strong.
In the last-seen signature method, corresponding tracks are determined based on the last-seen signature of an object. Thus, only a single signature is required to store the representation of a given track. However, the last-seen signature method cannot compensate for differences in pose due to the position of the camera relative to the orientation of the tracked object. Consider an example where the principal axes of a first and a second camera are orthogonal to each other and a person leaves the field of view of the first camera and later enters the field of view of the second camera. As the person exits the field of view of the first camera, the last-seen signature of the track may represent a profile view of the person, but as the person enters the field of view of the second camera, a frontal view of the person is seen, producing a different signature. This can lead to errors in determining corresponding tracks and is, therefore, a disadvantage in using the last-seen signature method.
An alternative approach for determining corresponding tracks is the exemplar method. The exemplar method associates a predetermined set of signature exemplars with each track. Ideally, each exemplar represents a different appearance of the object. For example, one exemplar may show a frontal view and another exemplar may show a side view. Exemplars may be determined by performing clustering of previously-seen signatures of an object. For example, one exemplar may represent a frontal view of a person, whilst another exemplar may represent one or more profile views of a person, and yet another exemplar may represent the rear view of a person. A disadvantage of the exemplar method is that the number of exemplars may have a fixed upper bound and therefore the correct choice of the number of exemplars is critical to determining a representative set of appearances. For example, incorrectly selecting the number of exemplars may lead to two distinct clusters of appearances being represented by one exemplar which is not actually representative of either cluster. When matching a given signature with a set of exemplars, the given signature is compared to each exemplar in turn. This can be computationally expensive and is another disadvantage of the exemplar method.
Thus, a need exists to provide an improved system and method for multi-camera object tracking.