Big data applications that use “cloud”-based techniques to process and store data may process data on the order of hundreds of gigabytes to terabytes or petabytes. For example, extract, transform, and load (“ETL”) applications may use Hadoop MapReduce (an Apache open source framework) to process big data sets over a cluster of computers using the Hadoop Distributed File System (“HDFS”). Software developers may write Hive or Pig scripts for reporting and analytics purposes. Apache Hive and Pig scripts may be transformed to MapReduce programs, on top of Hadoop. Validating such big data applications in agile development processes may entail using test data sets derived from big data sources. These test data sets may be generated manually by, e.g., project managers, architects, developers, and testers, or by using random test data generation tools, but the test data sets may not provide effective coverage or include quality test data.
Where considered appropriate, reference numerals may be repeated among the drawings to indicate corresponding or analogous elements. Moreover, some of the blocks depicted in the drawings may be combined into a single function.