Large datasets are often licensed by dataset consumers. As an example, a geospatial dataset provider may place 900 terabytes (TB) of LiDAR data representing the United States (US) into a distributed database. The geospatial dataset provider may then sell access to the dataset to dataset consumers. Dataset consumers may, as an example, use the dataset to generate digital elevation models.
Because of the size of some datasets or out of a desire to retain control of some datasets, access to some datasets may include running applications against the datasets directly. Accordingly, the dataset may remain in a storage medium or storage service with an application accessing the storage medium directly, rather than the dataset being duplicated onto another storage medium and sent to a dataset consumer for processing locally by the dataset consumer.
Map-reduce is a programming model for processing large datasets using a parallel, distributed method on a cluster of processing devices. A map-reduce application includes a map procedure that performs filtering and sorting and a reduce procedure that performs a summary operation. A map-reduce application generally runs on a cluster in parallel fashion. One framework implementation for running map-reduce applications is APACHE™ HADOOP®. The map-reduce programming model may be applied to the large datasets described above in order to provide desired processing results from the large datasets.