The present invention relates generally to efficient techniques for the analysis of large data sets, and more particularly to a technique that enables programs written in the R-language to support efficient operations on large data sets and large computations.
The R-language (R) is a statistical computing environment. R includes a variety of statistical and graphical capabilities that may be applied to arrays of data, e.g., scalars, vectors, matrices, tables and other data structures. R-language programs commonly perform statistical analysis on sets of data that are regular in nature and fit into or can be paged into the main memory of a computer where the data is operated on. R is designed to run in the main memory of one single computer in a single thread of execution. R is easily extensible and enables code written in other programming languages to be packaged into an R program.
MapReduce is a distributed data processing model that provides for the partition and distribution of computation and data over large clusters of servers, and enables a computation expressed as a MapReduce job to be executed in parallel on a plurality of computers. One example of a system that performs MapReduce computations is Hadoop®. Hadoop® is an open-source system that runs MapReduce jobs on clusters of computers that access a Hadoop® distributed file system (HDFS), a distributed file system that runs on large clusters of commodity machines.
SystemML compiles and automatically parallelizes machine learning (ML) algorithms which are written in declarative machine learning language (DML), a high level language oriented to ML tasks. SystemML produces sets of MapReduce jobs that can execute on Hadoop®.