Person tracking is one of the fundamental problems in computer vision. There has been extensive work on tracking humans and other objects using visible-light video cameras, also referred to as red, green, blue (RGB) cameras. Despite much progress, human tracking remains a largely unsolved problem due to factors such as changing appearances, occlusions, motion of the camera and object, illumination variation, and background clutter. To deal with appearance ambiguities, a variety of methods exist that are based on techniques such as sparse representation, template selection and update, subspace-based tracking, and feature descriptors.
A fundamentally different approach to appearance ambiguities is based on using multiple modalities of sensing. One option for multimodal person tracking is to use a thermal infrared (IR) camera in combination with an RGB camera. However, the widespread adoption of thermal imaging has been hampered by the prohibitively high cost of thermal infrared cameras. Herein, we use the term infrared and the abbreviation IR to refer solely to thermal infrared signals, and not to near-infrared (NIR) signals. We use the term RGB camera to refer to a video camera that operates in the visible range of the electromagnetic spectrum. We use the term RGB camera to encompass not only color cameras but also monochrome or grayscale cameras.
Information fusion across different modalities can be performed at various levels. For example, a low-level fusion approach can combine RGB and IR information at the pixel level, before features are determined. However, if there is a large difference between the spatial and temporal resolutions of the RGB camera and the IR sensor, then fusing low-level information is precluded. In a high-level fusion approach, a global decision might be reached after applying completely independent tracking in the two modalities.
We now describe prior-art approaches to tracking using three types of setups: an RGB camera alone (RGB camera-only tracking), an IR sensor alone, or a combination of both the IR sensor and the RGB camera (RGB+IR).
RGB Camera-Only Tracking
We now describe three basic approaches to RGB camera-only tracking. In the first approach, known as visual tracking, a single object to be tracked is manually marked in the first image of a video sequence. Then, the appearance of the object and background in the first image, along with the subsequent video images, is used to track the object over the course of the sequence. However, visual tracking methods do not include automatic initialization of tracks, which is a problem for many real-world applications. Furthermore, visual tracking methods typically track only one object at a time, and tend to drift off of the target object over long sequences.
A second approach for RGB camera-only tracking, the “tracking-by-detection” approach, provides a more complete solution for multi-person tracking. Tracking-by-detection methods rely on a person detector to detect people in images, then use appearance and other cues to combine these detections into tracks. Such methods often use a relatively slow (not real-time) person detector and combine tracks in an offline process.
An alternative paradigm for RGB camera-only tracking integrates detection and tracking more tightly with an online procedure. Examples of this third paradigm include the “detect-and-track” approach, which uses a background model to detect candidate objects for tracking and couples detection and tracking in a feedback loop.
IR-Only Tracking
Thermal IR imaging offers advantages in differentiating people from background by virtue of temperature differences. The simplest approach, which is widely adopted, uses intensity thresholding and shape analysis to detect and track people. Features traditionally used in RGB images, such as histograms of oriented gradients (HoG), and other invariant features, have been adapted to IR images for person detection. Background modeling in infrared can be combined with grouping analysis to perform long-term occupancy analysis.
Tracking Using RGB+IR
Prior art approaches differ in the level at which information from the IR and RGB streams are combined. Leykin and Hammoud, “Pedestrian tracking by fusion of thermal-visible surveillance videos,” Machine Vision and Applications, 2008 describe a system that combines RGB and IR information at a low level. Their system tracks pedestrians using input from RGB and thermal IR cameras to build a combined background model.
In contrast, the system of Davis et al., “Fusion-Based Background-Subtraction using Contour Saliency,” CVPR Workshop 2005, merges RGB and IR information at mid-level. Their system uses thermal and visible imagery for persistent object detection in urban settings. Statistical background subtraction in the thermal domain is used to identify an initial regions-of-interest (ROI). Color and intensity information are used within these areas to obtain the corresponding regions-of-interest in the visible domain. Within each region, input and background gradient information are combined to form a contour saliency map.
In yet another approach, Zhao et al., “Human Segmentation by Fusing Visible-light and Thermal Imagery,” ICCV Workshop 2009, first tracks blobs independently in the output of the IR camera and the output of the RGB camera, and then merges the information at a high level to obtain a combined tracker.
In each of these prior art approaches to tracking using RGB and IR cameras, the IR camera has about the same fast frame rate as the RGB camera.
U.S. Pat. No. 4,636,774 uses a motion sensor to turn lights ON and OFF. However, that method cannot distinguish motion of people from other motions in the room, nor can it determine the number of people in a room.
U.S. Pat. No. 8,634,961 uses a visible light camera mounted on a fan to detect people and accordingly turn the fan ON and OFF.
U.S. Pat. No. 5,331,825 uses an infrared camera to detect people in a room and accordingly control an air conditioning system.