The present invention relates generally to provisioning data to distributed computing systems, and more specifically to provisioning data to such systems via a network from remote data storage at a data-provider site.
Processing large amounts of data requires correspondingly large amounts of computation power and storage. Owners of large data sets commonly use distributed computing systems for processing the data to extract useful information. Distributed computing systems typically employ a plurality of networked node computers for parallel processing of data. Such systems provide for distributed storage of data to be processed and can support sophisticated parallel processing applications for efficient extraction of information from massive data sets. Distributed computing systems are commonly implemented in datacenters containing multiple high-performance computers which can be deployed as a scalable infrastructure to accommodate data/compute intensive parallel processing applications according to data-provider requirements. Some distributed computing systems are private systems dedicated to particular data providers. Others allow resources to be shared by different data providers. Both types of system can be offered as a rental facility in a cloud computing environment, allowing data owners to obtain information from their data sets without prohibitive investment in the necessary computing resources.
The scalability of distributed computing systems allows the processing necessary to transform data sets into useful information to be sped-up. Generally speaking, the more data that is available, the more information, and thus value, that can be extracted. The flexibility of distributed computing systems allows a data owner to scale out the infrastructure dedicated to process data sets as they grow, reducing the time required to extract valuable information. This is a desirable property, in that increasing value (size) of the data allows the owner to spend more on the infrastructure, thus keeping the time-to-value ratio constant. A remaining problem, however, is how to make the data available at the computing facility sufficiently fast that the data transfer does not negate the gains from scaling out the processing. The data set has to be transferred to the computing facility, which is typically off-site, via a network. As the amount of data grows and the processing itself becomes more scalable, this step consumes more and more of the time required to obtain the desired results from the data, thus decreasing value. Rentals for cloud computing systems, for example, are commonly available on an hourly basis, so spending more and more time transferring a data set for processing is highly undesirable, delaying availability of results and reducing value to the data owner.