In the digital age, organizations increasingly rely on digitally-stored data. Furthermore, organizations are increasingly using very large data sets for various applications. Continuing improvements in storage technology mean that many of the previous barriers to managing large data sets are disappearing, allowing even relatively small organizations to store and process large databases. In some cases, scale-out high performance databases serving live applications may store petabytes of data across tens of thousands of nodes.
However, as distributed storage techniques facilitate the explosive growth of production data sets, traditional systems may leave associated costs with very large data sets unaddressed. For example, various secondary uses of production data sets (e.g., backing up the data sets, using the data sets for developing new features for primary applications, etc.) may impose costs on production systems (potentially adversely affecting the performance of the primary applications that make use of the data sets) and/or on the computational infrastructure used to provide production data sets to secondary applications. In addition, secondary applications themselves may suffer performance issues as the time required to provide access to the data sets increases.
The instant disclosure, therefore, identifies and addresses a need for systems and methods for provisioning distributed datasets.