The present disclosure is related to a method and system for determining at least one property related to at least part of a real environment comprising receiving image information of an image of a part of a real environment captured by a camera.
Computer vision methods that involve analysis of images are often used, for example, in navigation, object recognition, 3D reconstruction, camera pose estimation, and Augmented Reality applications, to name a few. Whenever a camera pose estimation, object recognition, object tracking, Simultaneous Localization and Tracking (SLAM) or Structure-from-Motion (SfM) algorithm is used in dynamic environments where at least one real object is moving, the accuracy of the algorithm is often reduced significantly with frequent tracking failures, despite robust optimization techniques employed in the actual algorithms. This is because various such computer vision algorithms assume a static environment and that the only moving object in the scene is the camera itself, which pose may be tracked. This assumption is often broken, given that in many scenarios various moving objects could be present in the camera viewing frustum.
In such cases, accuracy of the camera pose tracking is reduced and, depending on the properties of the moving objects in the scene, tracking could become disabled (especially when the moving objects move to different directions). Furthermore visual object recognition methods may fail if the object to recognize is (partially) occluded by other objects (e.g. failure may be caused by that the visual appearance of the occluding objects is taken as an input in the object recognition method), no matter if they move or not.
In case of localization, tracking and mapping approaches, image features originating from unreliable objects are commonly dealt with by using various robust optimization techniques. For instance, camera pose optimization can be computed using the set of matches between 2D and 3D points. The derivative of pose, with regard to the re-projection error of the matches is readily available in the literature. The solution for camera pose can be computed using the least squares method, but this technique is known to be very sensitive to the influence of outliers. In order to minimize the effect of outliers, one can use iteratively re-weighted least squares, with m-estimator functions for re-projection error weighting. There are also other approaches for dealing with outliers, such as RANSAC, least median of squares etc. However, all mentioned approaches have certain limitations. E.g. m-estimators can deal with outliers, only up to a certain outlier/inlier ratio. In case of RANSAC, if there is a number of objects independently moving in the scene, there is a risk that the camera pose will not be estimated with regard to the desired object or environment, but with regard to a different object (e.g. the moving object that corresponds to an unreliable object).
There exist in the current state of the art many algorithms for detection and segmentation of dynamic (i.e. moving) objects in the scene. However, such approaches are usually computationally expensive and rely on motion segmentation and/or optical flow techniques. In general, a large number of frames is necessary to perform reliable moving object detection, using such techniques. Further, there are methods for compressing the video streams which commonly divide a scene into layers based on their depth or dynamic characteristics. E.g. see work by Adelson and Wang in reference [2]. These methods can also be used for detection and segmentation of independently moving objects in the scene. Further, there is a number of localization and mapping approaches that are crafted for deployment in dynamic environments. These approaches are often based on the Structure-from-Motion algorithm, or filter based, e.g. based on the Kalman filter or the particle filter. The downside of dynamic SLAM approaches is increased complexity and computational cost. Further, dynamic SLAM approaches usually require a large number of frames to achieve reliable segmentation of moving objects in the scene.
Das et al. in reference [3] propose a method for detecting objects based on the surface temperature profiles. The idea implies static objects observed within the environment. Reference [3] does not envision detection of independently moving objects, for which temperature profile description is given, or employment of this information for aiding either camera pose tracking or image recognition algorithms.
Adelson and Wang in [2] propose an algorithm for video compression based on segmenting image into layers with a uniform affine motion. The algorithm utilizes an optical flow algorithm for estimating pixel-wise motion. Afterwards, image segments with uniform motion are extracted utilizing the k-means algorithm.
In [6] Han and Bhanu propose a method for infrared and visible light image registration based on the human silhouette extraction and matching. It is assumed that an imaging rig consists of two stationary cameras. Initially, the image background is modeled, assuming normal distribution for each pixel in both infrared and visible light domain, which later enables simple human detection by a deviation from the modeled background.
Hyung et al. in reference [7] propose a method for 3D-feature point clustering into static and dynamic maps, and subsequent tracking of a robot's position based only on the static cluster. Feature tracking is performed based on the Joint Probabilistic data-association filter. Feature clustering is performed based on their positions and angular velocities.
Del-Blanco et al. in reference [4] propose a target detection and ego-motion estimation using the forward looking infrared imagery (FLIR), with the emphasis on airborne applications. Initially, edges are extracted from FLIR images using the Canny algorithm. Then, forward-backward tracking of extracted edges is performed to extract reliable image features and their frame-to-frame displacements. Ego-motion, i.e. camera motion, is computed using RANSAC and Least Median of Squares algorithm with a restrictive affine motion model. Once the camera motion is computed, a determined set of outliers is further clustered into separate targets based on the feature connectivity.
Fablet et al. in reference [5] propose a cloud segmentation algorithm in infrared images. An affine motion model is estimated using a modified optical flow equation optimized via IRLS with m-estimators. Actual segmentation is achieved using Markov Random Field modeling.
Tan et al. in reference [12] propose a modified PTAM (see reference [8]) approach for handling moving objects in the scene. Occluded points are detected using a heuristic algorithm that takes into account change in the feature appearance and geometric relation to the neighboring feature points. Points that are not found at their expected position and are not occluded are assumed to be outliers and are excluded from further localization and mapping. Further, the authors propose a bin-based sampling and sample evaluation for RANSAC, where the bin fidelity is estimated based on the inlier/outlier ratio. This approach for exclusion of image features corresponding to moving objects is custom built only for PTAM based tracking and mapping algorithms.
A similar method is proposed by Shimamura et al. in [10]. In a freely moving camera scenario, outliers are detected by a robust pose optimization algorithm. Once the outliers are extracted, they are filtered to exclude outliers originating from repetitive textures, or a lack of texture. Afterwards, optical flow vectors of outliers are clustered using the expectation-maximization algorithm (EM) for parameter fitting of a Gaussian mixture model. The first problem with this approach is that it assumes that the number of outliers, i.e. points belonging to a moving object is lower than the number of inliers. Further, the number of moving objects in the scene has to be known in order to initialize the EM algorithm.
Zou and Tan in reference [14] propose a collaborative approach to SLAM in dynamic environments by assuming a number of freely moving cameras. Pose estimation is performed by simultaneously optimizing poses for all cameras and 3D coordinates of dynamic points. In this manner, the poses of the cameras, which are observing largely dynamic parts of the scene, can be optimized with regard to the cameras which are observing mostly static parts of the scene.