Public venues such as shopping centres, parking lots and train stations are increasingly subjected to surveillance with large-scale networks of video cameras for applications such as security, safety, traffic management, and business analytics. In these applications, the surveillance system often captures close-up images of objects, such as humans, animals, or inanimate objects, in the area or persistently tracks movement of a suspicious object. In order to persistently track an object, a camera in the surveillance system follows the movement of the object on site. When the object is about to move out of the physical viewing limit of the camera, a second camera in the same network is assigned responsibility to track the object. The change in responsibility from the first camera to the second camera is often referred to as a “handoff” process. The handoff process typically happens between cameras with overlapping field of views. If the field of views of the cameras do not overlap, either spatially or temporally, a similar process called “object re-identification” may be performed. A key task in handoff or re-identification is to perform rapid and robust object matching from images of the objects captured by the two cameras.
Object matching from different camera viewpoints is difficult because different cameras often operate on different lighting conditions. Moreover, different objects may have similar visual appearance, or the same object (e.g., a person or a subject) can have different pose and posture across viewpoints.
One image processing method performs appearance-based object matching, which involves determining visual features of a query object from a first view, determining the same type of visual features of a candidate object from a second view, and then comparing the difference between the visual features. If the difference is smaller than a threshold, the query object and the candidate object are said to match. Otherwise, the query object and the candidate object do not match.
The visual features of an object can be computed from a single frame (i.e. image) or they can be computed from multiple frames. In single-shot object re-identification, a single image of the candidate object is matched to a single image of the query object. In multiple versus single-shot object re-identification, multiple images of the candidate object are available to be matched against a single image of the query object. The multiple images of an object are often captured by tracking the object across multiple video frames. Similarly, single-versus-multiple object re-identification involves matching a single candidate image with multiple images of the query object. Finally, multiple-to-multiple object re-identification involves having multiple images of the candidate and the query objects.
Having multiple images of an object of interest can be advantageous over having only one view of the object, especially if the object appears differently while moving. For example, a pedestrian may look different at different time instances during a walking cycle. Another example where multiple images is advantageous is when an object changes orientation during motion. In this case, the front and side views of the object may appear differently. However, comparing multiple views of the candidate object to multiple views of the query object can also lead to more computation. Moreover, in the presence of outlier frames, such as temporary occlusion, multiple-vs-multiple object re-identification can lead to more confusion.
To reduce object matching computation, one method combines the visual features computed from multiple frames to a single set of visual features. One way to combine features from multiple frames is to accumulate them, whereby each feature is averaged to a frequently appeared value across multiple frames. The averaging operation also has a de-noising effect. However, the accumulated features can also be corrupted by a small number of outlier instances, whose feature values are far beyond the normal expected range. For an object with a consistent appearance across multiple frames, the de-noising effect has a diminishing return after a small number of frames (e.g. 4 or 5 frames) are averaged.
Another known method uses a subset of frames instead of a whole image sequence of tracked objects. Landmark frames are detected using motion activity around the object of interest. Short video fragments around landmark frames are then used for object re-identification. Assuming the object of interest is a pedestrian and each image in the sequence is cropped to a tight bounding box around the tracked person, motion activities in the bottom half of each image correlate well with the person's walking cycle. A trough in the motion activity around the leg area corresponds to the time instance when the legs are furthest apart. A peak in the motion activity of the leg area corresponds to the time instance when the legs are momentarily co-joined. Matching objects around landmark frames ensure they are at the same pose, which increases the chance of a correct matching score. However, due to background motion and dynamic occlusion, motion activity often does not peak and trough at the desired location. As a result, motion-based landmark frame selection is often not reliable in a crowded scene.
Yet another known method uses object tilt to cluster frames with similar viewpoint from the camera. Multiple appearances of the object are obtained from the corresponding viewpoint clusters for re-identification purpose. An upright pedestrian, for example, is often captured as tilted by a high-mounted downward-looking camera if the person is not on the principal axis of the camera. When the person suddenly turns, his or her apparent tilt changes and so is his or her viewpoint from the camera. By assuming the object appearance is similar at the same viewpoint (i.e. orientation) from the camera, one can cluster frames based on the person's tilt. However, this indirect clustering method is limited to high-mounted downward-looking surveillance cameras only.
Accordingly, there exists a need for an improved object re-identification method.