From predicting the weather to planning the future of our cities to recovering from natural disasters, accurately monitoring widespread areas of the Earth's surface is essential to many scientific fields and to society in general. These observations have traditionally been collected through remote sensing from satellites, aerial imaging, and distributed observing stations and sensors.
These approaches can observe certain properties like land cover and land use accurately and at a high resolution, but unfortunately, not everything can be seen from overhead imagery. For example, Wang, et al., in “Torontocity: Seeing the world with a million eyes,” arXiv: 1612.00423 (2016), evaluates approaches for urban zoning and building height estimation from overhead imagery, and conclude that urban zoning segmentation “is an extremely hard task from aerial views,” that building height estimation is “either too hard, or more sophisticated methods are needed.”
More recently, the explosive popularity of geotagged social media has raised the possibility of using online user generated content as a source of information about geographic locations. This approach is sometimes referred to as image-driven mapping or proximate sensing. Mathematically, the result of this process can be represented as a geospatial function that takes as input a geographic location and generates as output a value of interest or a probability distribution over that value.
For example, online images from social network and photo sharing websites have been used to estimate land cover for large geographic regions, to observe the state of the natural world by recreating maps of snowfall, and to quantify perception of urban environments.
Despite differing applications, the prior approaches to proximate sensing each estimate the geospatial function, and view each social media artifact (e.g., geotagged ground-level image) as an observation of the value of this function at a particular geographic location.
These typical approaches to proximate sensing (1) collect a large number of samples, (2) use an automated approach to estimate the value of the geospatial function for each sample, and (3) use some form of locally weighted averaging to interpolate the sparse samples into a dense, coherent estimate of the underlying geospatial function. This estimation is complicated by the fact that observations are noisy because state-of-the-art recognition algorithms are imperfect, and therefore some images are inherently confusing or ambiguous, and the observations are distributed sparsely and non-uniformly. Accordingly, in order to estimate geospatial functions with reasonable accuracy, most techniques use a kernel with a large bandwidth to smooth out the noise. These approaches thus yield undesirably coarse, low-resolution outputs that are insufficient for many applications.
For example, many recent studies have explored analyzing large-scale image collections as a means of characterizing properties of the physical world.
Estimating properties of weather from geotagged and timestamped ground-level imagery has also been proposed. However, these proposals do not utilize novel techniques for proximate sensing, but rather utilize the prior approaches, in which standard recognition techniques are applied to individual images, and then spatial smoothing and other noise reduction techniques are used to create an estimate of the geospatial function across the world.
Meanwhile, remote sensing has used computer vision to estimate properties of the Earth from satellite images. Overhead imaging is, however, markedly different from ground-level imaging, and so remote sensing techniques have largely been developed independently and in task-specific ways. As such, a framework for estimating geospatial functions via combining visual evidence from both ground level and overhead images has not been pursued.
Indeed, while it has been proposed to use visual evidence from ground level or overhead images or location context in order to improve classification or give context for event recognition in ground level or overhead images, these proposals do not combine visual evidence from both ground level and overhead images to estimate geospatial functions.
In contrast, the present invention is directed to a system that can estimate any given geospatial function of the world via integrating data from both ground level imagery (which often contains visual evidence that is not visible from the air) and overhead imagery (which is typically much more densely sampled), and which learns in an end-to-end way, avoiding the need for task-specific or hand-engineered features.