Large-scale data processing involves extracting data of interest from raw data in one or more datasets and processing it into a useful data product. The implementation of large-scale data processing in a parallel and distributed processing environment typically includes the distribution of data and computations among multiple disks and processors to make efficient use of aggregate storage space and computing power.
Various languages and systems provide application programmers with tools for querying and manipulating large datasets. These conventional languages and systems, however, fail to provide support for automatically parallelizing these operations across multiple processors in a distributed and parallel processing environment. Nor do these languages and systems automatically handle system faults (e.g., processor failures) and I/O scheduling. Nor do these languages and systems efficiently handle the analysis of data records.