Analyzing large data sets, such as biological data may use extensive computing resources and may involve large amounts of sensitive data. For example, gene sequences (e.g., genomics data coming from next generation sequencing (NGS) machines) are highly complex and large data sets including sensitive personal data. Given the breadth of data sets, many entities may store their data on various different computing resources, such as local servers, cloud repositories, and the like. Currently, large datasets often require copying the data and sending the data to the machine or machines that will be performing the computational analysis. This approach uses unnecessary computing resources to copy, transfer, and download data and makes sensitive data more prone to a security or regulatory breach, as well as incurring additional costs associating with downloading data (e.g., cloud egress charges). Additionally, many of the computer resources, such as a local database and a cloud repository have to be accessed separately from one another, preventing streamlined analysis across multiple data sources and/or databases.
It is therefore desirable to provide a system for performing computational data analysis over data sets that are distributed across different storage locations. Input data for analysis resides on a secure remote storage location. Analysis servers select computational machines to perform analysis on the input data using a pipeline and to create a secure cluster on the secure remote storage location. The selected computational machines perform the analysis on the input data using the pipeline by streaming the input data to the secure cluster during analysis.