The present invention relates to distributed sampling, and more particularly, this invention relates to sampling using block-partitioned matrices.
When conducting advanced analysis, such as cross-validation (CV) and ensemble learning (EL), random samples of data must be generated for conducting the analysis. In order to support the wide range of sampling techniques that may be required for the most-used CV and EL approaches, it is necessary to have a general framework for sampling that supports a wide variety of sampling techniques.
The most complicated form of sampling is sampling with replacement, where one or more samples are taken from a given dataset, and then replaced. This is further complicated in the big data setting, where the dataset is commonly stored in a distributed, blocked format.
Sampling with replacement in a big data setting is problematic for several reasons. First, a single assignment table must be materialized in a manner that maps observations to positions in the samples. Second, distributed join operations between data matrices and assignment tables may introduce inefficiencies. Finally, re-blocking must be done differently depending on the join strategy, and re-blocking the join results utilizes significant resources.