The need to extract knowledge from data collected on a global scale continues to grow. In many cases the data may be dispersed across multiple geographic locations, owned by different entities, and in different formats. Although numerous distributed data processing frameworks exist today, these frameworks have significant drawbacks. For example, data-intensive computing tasks often use data processing frameworks such as MapReduce or Spark. However, these frameworks typically require deployment of a distributed file system shared by all of the processing nodes, and are therefore limited to data that is accessible via the shared distributed file system. Such a shared distributed file system can be difficult to configure and maintain over multiple local sites that are geographically dispersed and possibly also subject to the above-noted differences in ownership and data format. In the absence of a shared distributed file system, conventional arrangements may require that data collected from sources in different geographic locations be copied from their respective local sites to a single centralized site configured to perform data analytics. Such an arrangement is not only slow and inefficient, but it can also raise serious privacy concerns regarding the copied data.