In the field of autonomous navigation and computer vision, free space is defined as an area in front of a moving object, e.g., in front of or behind a vehicle, boat, or robot, where the object can manoeuvre without colliding with other objects. Another name for the free space is drivable space.
With the use of accurate maps and localization systems, autonomous navigation provides incremental navigation directions to the moving object to travel from point A to point B without colliding on any obstacles along its path. To do this, it is necessary to know critical information that is necessary to avoid obstacles, and a most cost-effective approach to obtain the critical information.
The most critical information for the autonomous navigation is the free space. It is well-known that the free space can be estimated using stereo cameras. For example, it is possible to estimate a ground plane and obstacles above the ground plane using a stereo camera system. The concept of occupancy grids is closely related to free space estimation. An occupancy grid refers to a two-dimensional (2D) grid where every cell models the occupancy evidence of the environment, and is typically estimated using a three-dimensional (3D) sensor that measures distances on a planar slice of the environment, such as a scanning LIDAR and an array of ultrasound sensors.
A stixel world representation has been used for the free space estimation problem. The stixel world refers to a simplified model of the world using a ground plane and a set of vertical sticks on the ground representing the obstacles. The model can compactly represent an image using two curves, where a first curve runs on the ground plane enclosing a largest free space in front of the camera and a second curve indicates the height (vertical coordinates) of all the vertical obstacles at a boundary of the free space. The stixel world can be determined using depth maps obtained from stereo cameras. There are several algorithms that determine the depth maps from stereo images, such as semi-global stereo matching method (SUM). Stixels can be also determined without explicitly estimating the depth maps from stereo images using dynamic programming (DP). Those techniques either implicitly or explicitly determine depth using a stereoscopic or 3D sensor.
To reduce the system complexity and cost, it is desired to determine the free space from a sequence of images, i.e., a video, acquired by a monocular camera mounted on the moving object. There are several challenges in solving this problem using monocular videos instead of stereo videos. In contrast to other segmentation problems, it is not possible to rely completely on color or edges. For example, in videos of roads, strong gradients from cross-walks and lane markings are often present. In the case of water, there is often reflection from nearby boats, buildings, or sky. Features based on homography, that relies on planar roads, may not be accurate due to non-flat roads. Furthermore, the moving objects have additional challenges in monocular free space estimation.
It is known how to perform geometric layout estimation from single images. It is possible to classify the pixels in a given image into ground, buildings, and sky. This classification has been used to obtain popup 3D models of buildings. A scene can be modeled using two horizontal curves that partition an image into top, middle, and bottom regions. It is shown that this problem of segmenting an image into regions can be done using a globally optimum method.
The general idea of using dynamic programming for column-wise matching has been used for estimating the 3D models of buildings, and generalized to work with several layers of height-maps for modeling urban scenes.
Monocular videos have been used by simultaneous localization and mapping (SLAM) methods. Most of those methods provide a sparse point cloud and do not explicitly estimate the free space, which is the most critical information for autonomous navigation.
To the best of our knowledge, we are not aware of any free space estimation method for boats in the water using prior art computer vision techniques. The segmentation of water in an image of a scene is particularly challenging due to its specular properties, such as the reflection of nearby obstacles or sky on the water. Features such as color and edges perform poorly under such cases.