Accurate motion estimation of an object, such as a vehicle or robot, from a video acquired of an environment, e.g., a road or garage, near the vehicle by a camera mounted on the vehicle is an important problem in vehicle and robot navigation. Most conventional methods either use a camera model (monocular or stereo), or a motion model (planar or non-planar). To determine a relative motion of the vehicle with respect to the environment from a sequence of images, a minimal number of feature correspondences in a hypothesize-and-test framework, such as random sample consensus (RANSAC), produces accurate results in the presence of outliers.
Dense depth estimation from video sequences using a vehicle-mounted camera can be extremely useful for safety applications, such as detecting people and obstacle near moving vehicles, particularly in constricted environments such as garages, loading docks, drive ways, parking lots, and generally roads, etc., when the vehicle is backing up.
Minimal Solutions
Nistér's well known five-point method with a RANSAC framework is the preferred method for motion estimation in the presence of outliers. In the case of relative motion between two cameras, there are six degrees, of freedom (DOF) in the motion parameters: three DOF for rotation and three DOF for translation. For conventional cameras with a single center of projection, only five parameters can be determined, i.e., the translation can only be determined up to a scale. Accordingly, a minimum of five feature correspondences is needed to determine the motion parameters.
The feature correspondences can be obtained using Harris corners, a Kanade-Lucas-Tomasi tracker (KLT), and a scale-invariant feature transform (SIFT), for example. Usually, minimal approaches lead to a finite number of solutions for the motion, and the correct motion is selected based on physical constraints, or additional point correspondences.
Minimal solutions are known for several calibration and 3D reconstruction problems: auto-calibration of radial distortion, perspective three point problem, the five point relative pose problem, the six point focal length problem, the six point generalized camera problem, the nine point problem for estimating para-catadioptric fundamental matrices, the nine point radial distortion problem, point-to-plane registration using six correspondences, pose estimation for stereo setups using either points or lines, and pose estimation for monocular setups using both points and lines.
Restricted Motion Models
The relative motion of the camera is usually constrained by the associated application. For example, a camera mounted on a vehicle does not generally have all 6 DOF. If the traveling surface is planar, the camera can only undergo three DOF (two DOF of translation and one DOF of rotation).
Scaramuzza et al. have shown that motion can be parameterized using only one parameter for a certain class of vehicles, bicycles, and robots. Thus a 1-point method can be used. The underlying idea is that there exists an instantaneous center of rotation (ICR), and the vehicle follows a circular course around the ICR.
When an inertial measurement unit (IMU) is available, two measurement angles can be obtained using a gravity vector. The remaining unknowns are three parameters (1 DOF of rotation and 2 DOF of translation), which can be solved by a three-point motion estimation method using a quartic equation. This motion estimation method can be useful for cameras in hand-held digital devices, such as cellular telephones.
Another method uses 2-point motion estimation method for planar motion sequences. This is applicable for indoor robot ego-motion estimation when the camera mounted on the robot moves on a plane. The number of degrees of freedom is three (1 DOF of rotation and 2 DOF of translation). However, the relative motion can be recovered only up to a scale. In the RANSAC framework, the number of iterations required is usually smaller when the number of points required to determine the motion decreases. Given the complexity of the equations, that method determines the solutions iteratively with a Newton-Raphson algorithm, which consumes time, and not amenable for real-time applications.
Simultaneous Localization and Mapping (SLAM)
SLAM uses a motion model to smooth the trajectory of the camera and constrain the search area for feature correspondences for 3D environment reconstruction. SLAM is a method for fusing inertial measurements with visual feature observations. The current camera pose, as well as the 3D positions of visual landmarks are jointly estimated. SLAM-based methods account for the correlations that exist between the pose of the camera and the 3D positions of the observed features. However, SLAM-based methods suffer high computational complexity because properly treating the correlations is computationally complex, and thus performing vision-based SLAM in environments with thousands of features is problematic for real-time applications.