Data analysts and statisticians are interested in performing statistical analysis on large-scale datasets such as crawled documents, web content, web log requests, search traffic, and advertisement impressions. These large-scale datasets, which may be obtained from the Internet, can often be multi-terabytes in size.
Processing a large-scale dataset may include parallel processing, which generally involves performing some operation over each element of a dataset. The various operations may be chained together in a data-parallel pipeline to create an efficient mechanism for processing a dataset. Conventional statistical data analysis tools may not handle such massive amounts of data well.
Data analysts primarily use the R programming language for statistical data analysis since R provides more advanced statistical features than other programming languages. R is a dynamically-typed, interactive, interpreted programming language used by analysts for statistical computing and graphics. Unfortunately, R lacks capabilities for working with datasets that are too large to fit into memory.
Although there are a number of R packages that emulate the normal R capabilities, these conventional packages perform large-scale computations so slowly that they are essentially unusable for any meaningful statistical data analysis. As recognized by the inventors, there should be an easy, natural, and powerful programming environment that allows analysts and statisticians to efficiently analyze large-scale datasets.