Map Reduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm executing on a cluster of computers. Map Reduce utilizes distributed servers to run various tasks in parallel, while managing all communications and data transfers between the various parts of the system. This provides for redundancy and fault tolerance. The Apache Hadoop platform is an open-source software framework that implements Map Reduce. Data stored in a distributed file system are assumed to be Hadoop Distributed File System (HDFS) or its derivatives.
In-memory data processing is fast. Apache Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly. Apache Spark is not tied to the Map Reduce paradigm and has far greater performance in certain circumstances.
It would be desirable to leverage the strengths of in-memory data processing and Map Reduce while executing complex analytics workflows.