Big data is a reference to datasets so large that it is challenging to process the datasets with typical database management system (DBMS) tools, and data processing applications. MapReduce systems and massively parallel processing (MPP) systems are two approaches to deal with big data, processing large datasets.
MapReduce is a programming model for processing large datasets with a parallel, distributed processing on a cluster of compute nodes. The MapReduce infrastructure manages the distributed servers of the cluster to run various tasks in parallel; manages communications and data transfers; provides for redundancy and fault tolerance; and, so on. MapReduce-like computations share many computational characteristics of a MPP database, but MPP databases can outperform a similar MapReduce job by more than an order of magnitude. Unfortunately, MPP databases may not scale well, may not be fault-tolerant, and also may not have other MapReduce-like properties.