The present invention relates generally to stratified sampling of large datasets, and more specifically to using adaptive parallel data processing techniques to perform stratified sampling of large datasets.
Enterprises are not only collecting increasing amounts of data, but are also maintaining large historical archives in the order of petabytes. Processing such data in order to derive useful information and interesting patterns from it is a challenging task, especially when there are time and resource constraints. The sheer volume of data is a major contributing factor to the difficulty of the challenge. Sampling has been established as an effective tool for reducing the size of the input data.
Generally, many advanced analytical tasks have time and resource constraints that can be satisfied only using sampling techniques. In particular, in massive datasets, there are groups, called strata, within an overall population with varying characteristics. It is often advantageous to sample each stratum independently. This improves the representativeness of the sample, reduces the sampling error, and provides approximate aggregates with much less variability than a random sample of the whole population.