In recent years, 3D cameras based on the Time-of-Flight (ToF) principle have become commercially available. Compared to 2D cameras, they measure, for each pixel, a radial distance of a point in the scene to the camera, while 2D cameras provide only a gray- or colour image of the scene. On the other hand, ToF cameras have much lower resolution than common 2D cameras and the range measurement is affected by noise. Therefore, there are many research and development activities ongoing that target fusing the data of a 2D and a 3D camera in order to profit from the mutual strengths of the different sensor technologies. In the context of the present document data fusion designates fusion of raw data, i.e., a low level procedure as opposed to higher fusion levels in which the fusion deals with post processed data (feature or decision fusion). A possible application is, e.g., image matting (separation of background and foreground). In occurrence, the background and/or the foreground of a 2D image may be identified based on the range information of the 3D image (see, e.g. [1]). Other research activities target enhancing the accuracy and resolution of a 3D camera by fusing the range data with a high resolution 2D image (see e.g. [2] and [3]).
Raw data fusion requires accurate pixel alignment between the recorded data of the individual sensors. This alignment, also called data matching, comprises mapping of the two individual data sets to a common image coordinate grid, which is defined with respect to a unified reference frame. The relationship between the individual sensor reference frames to the unified reference frame (which may coincide with one of the sensor reference systems) determines in this case the mapping of the two data sets onto the common image grid, i.e. the data matching.
A particular problem occurs if the reference frames of the two sensors are not co-centric, i.e. if the two cameras are displaced with respect to each other, which is typically the case. Due to a relative displacement of the two cameras, the location of the projection of a 3D point of the scene onto the individual sensors differs by a shift that is known in the field of stereo vision as binocular disparity. This disparity shift depends on the distance from the imaged point in the scene to the camera. The correspondence of the pixels of the 2D and the 3D camera is not, therefore, a fixed relationship but rather dependent on the objects in the scene. Thus, the mapping of the data on the common grid depends on the distances in the scene and has to be re-calculated whenever the scene changes, which is typically the case for every frame of data acquisition.
In stereo vision, the problem is known as the correspondence problem. Its solution provides a so-called disparity map, which allows the calculation of the distances of object points [6,7]. The detection of corresponding points is typically performed by feature matching or correlation analysis of two stereo images. These methods are numerically demanding and may fail in case of shadow effects, unstructured scenes, or periodic patterns.
The matching of 2D camera data with data from a 3D sensor requires also dealing with the correspondence problem. Besides the fact that stereo vision techniques are numerically demanding, their application is rendered difficult, if not impossible, in case that the resolutions and the types of data of the two sensor data sets are different. This it is the case, however, for a sensor system comprising of a low-resolution 3D sensor and a high resolution 2D camera system.
It has to be noted that in stereo vision the correspondence problem is solved (by feature matching or correlation analysis) in order to determine the distances of the corresponding points in the scene. In the case of data fusion of a 2D and a 3D image of the same scene, the aim is not to extract the distances based on the disparity map. Indeed, as the data captured by the 3D camera contain distance information on the scene, the disparities between the projections on the different sensors can be estimated. The disparity map can finally be used to identify corresponding pixels in the two images.