Public venues such as shopping centres, parking lots and train stations are increasingly subjected to surveillance with large-scale networks of video cameras. Application domains of large-scale video surveillance include security, safety, traffic management and business analytics. One of the key tasks in the application of large-scale video surveillance is face verification, that is to match faces captured by different cameras at different times and locations. Face verification is often required to match faces in different viewpoints or in different poses as cameras at different locations often have different view angles or persons appearing in the camera field of view may have different head poses. Matching faces from different camera viewpoints or in difference poses is difficult. Face verification systems are generally developed for visible-to-visible face verification, which matches faces captured by visible cameras. A visible camera usually forms an image using visible light (between 0.35 μm and 0.74 μm wavelength range) with a charge coupled device (CCD) or a CMOS sensor. A visible image often has three colour channels. The image quality of a visible camera is heavily dependent on the illumination condition of the scene being imaged. A visible camera may fail to produce a good quality image under an environment lacking in illumination or without any illumination.
Cross-modality face verification matches faces captured by different sensor modalities at different times and locations. One example is thermal-to-visible face verification, that is to match a face captured by a visible camera to a face captured by a thermal infrared (IR) camera. A thermal infrared camera forms an image using infrared radiation (between 3 μm and 14 μm wavelength range). A thermal infrared image has a single channel indicating the temperature distribution of the scene being imaged. A thermal infrared camera is able to work in a completely dark environment without any dependence on the amount of ambient light being present. Another example is depth-to-visible face verification that is used to match a face captured by a visible camera to a face captured by a range imaging camera, which produces depth information on the scene being imaged. A time-of-flight (ToF) camera is a type of range imaging camera and produces depth information by measuring the time-of-flight of a light signal between the camera and the scene being imaged based on the known speed of light. The image quality of a range imaging camera does not depend on illumination conditions. One application scenario of cross-modality face verification is a network of surveillance cameras with non-overlapping fields of view in a wide area installation such as airport or train station. A person has performed a suspicious act such as leaving an unattended bag in the darkness, where only a thermal infrared camera or a time-of-flight camera is able to detect the person's face. The thermal infrared face image or depth face image is used as a query to find this person in the views of visible cameras in the camera network. Compared to visible-to-visible face verification, cross-modality face verification is more challenging due to the large modality gap or very different sensor characteristics between different sensor modalities and different viewpoints or poses of faces. Moreover, a thermal infrared image or a depth image is usually of considerably lower resolution than a visible image. Such resolution gap makes cross-modality face verification even more challenging for matching faces.
One image processing method for thermal-to-visible face verification uses a partial least-squares discriminant analysis for modelling the modality difference in a latent space where the correlation between features extracted from thermal face images and visible face images are maximised.
Another image processing method for thermal-to-visible face verification learns two dictionaries, each dictionary containing numerous basis atoms and providing a sparse representation for features extracted from a sensor modality. Each feature vector extracted from a sensor modality can be compactly represented as a weighted linear combination of basis atoms from a dictionary. The relationship between two sensor modalities is modelled by the difference between the weights for one dictionary and the weights for the other dictionary in dictionary learning.
In another image processing method for thermal-to-visible face verification, a feed-forward deep neural network is used to directly learn a non-linear mapping between two sensor modalities to bridge the modality gap while preserving the identity information. The objective function for finding the non-linear mapping is designed to minimize the perceptual difference between visible and thermal face images in the least mean square sense. The input to the deep neural network is an image captured by one sensor modality while the output of the deep neural network is an image captured by the other sensor modality.
The above-mentioned image processing methods require face alignment to geometrically transform all the face images captured by different sensor modalities to a canonical frontal representation based on numerous facial landmarks located on the eyes, nose, and lips. The performance of automatic landmark localisation in visible face images is often dependent on the poses of faces and illumination conditions. If a pose of a face is very different from the frontal representation or the illumination is low in a visible face image, landmark localisation may produce erroneous facial landmarks. Automatic landmark localisation often cannot perform well on thermal face images mainly because some facial regions such as eyes and lips on a person's face may have approximately the same temperature distribution. These facial regions may have approximately the same intensity in the thermal face image so that facial landmarks detected in these facial regions are often inaccurate. Inaccurate facial landmarks introduce errors in transformed face images in the canonical frontal representation and consequently deteriorate the performance of face verification.
A need exists to address problems relating to matching objects using different camera modalities and/or poses.