Large-scale data processing may include extracting data of interest from raw data in one or more datasets and processing it into a data product. The implementation of large-scale data processing in a parallel and distributed processing environment may include the distribution of data and computations among multiple disks and processors to make use of aggregate storage space and computing power.