Batch Processing Systems express jobs in terms of a Directed Acyclic Graph (DAG) of tasks (or stages). Each task is partitioned (based on the input data) for parallel processing and includes producer stages and consumer stages which rely on the output of the producer stages. The producer stages are executed on producer nodes and the consumer stages are executed on consumer nodes. Intermediary data flows through these tasks in many forms, such as one-to-one, one-to-many, many-to-one, and/or, many-to-many. Many group based operations (such as aggregation, join, sort, etc.) usually require many-to-many data exchanges referred to as “data shuffle.”
FIG. 1 illustrates an example of data shuffle in a data exchange. Batch systems use data shuffle across computers whenever a data transformation (which operates on a group of rows) uses a new set of key input columns (e.g., Sort, Join, and Aggregate). This data shuffle operation is paramount to the overall batch system's performance, fault tolerance, and scalability characteristics.
The data shuffle phase is network intensive. Most output of a producer stage is sent to the next consumer stage. In Big Data, fault tolerance and task orchestration requirements add additional challenges. For example, intermediary data is usually saved on disk before being sent. Although disk drives can usually operate at 80 MB/sec of sequential concurrent reads and writes, this throughput dramatically decreases when the number of accessed files increases.
In theory, intermediary data is expected to be read from the Operating System (OS) buffer cache. In practice, this is not the case. Since the cache is shared across file systems, including Hadoop Distributed File System (HDFS), the OS cannot know what data will be read again soon (HDFS is a Java-based file system that provides scalable data storage and was designed to span large clusters of commodity servers).
Additionally, the number of “spill files” for a particular set of tasks is associated with multiple factors, such as the number of producers and consumers, the partition size, and data exchange and orchestration logic. Spill files are files that are created on disk if there is not sufficient memory to execute a command (such as query) in memory.
An inefficient data exchange impacts the overall runtime of small and large jobs. For both types of jobs data is always spilled to disk and large and smaller jobs can be executed in parallel. For this reason, optimizers try to filter out as much intermediary data or eliminate data shuffle altogether (e.g., Map side Join).
FIG. 2 illustrates the map phase performed by a mapper in a Map-Reduce system. The Map-Reduce system will typically include many mappers on one or more producer nodes (producers) which can operate in parallel. At step 1 the input for the map step is read by the mapper. At step 2 the map step is performed. This step maps the input data to the corresponding output which will be input to the reduce step. For example, if the input was a specific data value in a column of a table, the map step could identify other data values occurring in the rows of that particular column of data. At step 3 the output data is sorted and at step 4 the output data is stored (using a hash function) on disk. At step 5 the data can be read and merged so that all of the output data which is designated for a particular reducer (which is on a consumer node) is consolidated into a single file. At step 6 each of these output files can be written to memory. Therefore, the map step for each mapper will produce a file for each reducer which is a consumer of that mapper. Each of these files will contain the corresponding output data for that mapper.
FIG. 3 illustrates the reduce phase performed by a reducer on a consumer node in a Map-Reduce system. At step 1 the consumer node, reducer, or the map-reduce infrastructure will query the mapper's local disk to determine where the corresponding input data for that reducer is located (this is the output data from the map step). Then at step 2 this data (which will include a file for each mapper that maps to that reducer) is read from the network. This data is then written to disk in step 3 and then read and merged at step 4. This can be performed by merge-sorting the read data. In this step, all of the read files for that particular reducer are merged and sorted. At step 5 the merged and sorted data is written to disk. At step 6 this data is then reduced into one logical file. At step 7 this is data is written to disk for the next cycle of map-reduce or for output.
As described above, batch systems such as MapReduce save mappers data on local disk and then on HDFS (reducer output). This guarantees fault tolerance and provides linear scalability. However, its performance is degraded by the excessive use of disk IO and the requirement to publish each MapReduce result to HDFS.
New batch systems like Spark and Tez address some of these deficiencies by eliminating the need to commit intermediary data to HDFS and by optimizing small data shuffle (in-memory).
Map-Reduce and Spark data shuffles use a “pull” model. In Spark, the HDFS-write-read (WR) barrier (from Map-Reduce) is removed, resulting MRR (Map—Reduce—Reduce) and the Data Exchange logic is contained within each Spark Executor (an executor is an execution device that executes a particular task). Every map task writes out data to local disk, and then the reduce tasks make remote requests to fetch that data. Originally, the total number of files created was M×R, where M is total number of producers (mappers) and R is total number of consumers (reducers). Shuffle consolidation improvements were able to decrease this number to C×R, where C is the maximum number of concurrent producers. Even with this change, users often run into the “too many open files” limit when running jobs with non-trivial numbers of reducers. Additionally, Spark originally utilized only a “hash” based shuffle unlike the “sort” based shuffle of Map-Reduce. This Data Shuffle suffers from costly Java Virtual Machine (JVM) costly garbage collection.
Tez is a pluggable distributed processing framework. Unlike Spark, higher level applications have to plugin transformation logic. Tez Data Shuffle is similar to Spark and previously offered in-memory data-shuffle, which was later removed. Similar to Spark, the Data Exchange logic is contained within each Tez Executor. In Tez, the application is responsible for driving the execution logic including data exchanges.
New batch systems like Spark and Tez address some of the deficiencies of MapReduce by eliminating the need to commit intermediary data to HDFS and by optimizing small data shuffle (in-memory). However, as discussed above, the data exchange logic for both Spark and Tez is contained within each executor. This is not optimal since a data shuffle framework which is currently embedded within batch processor engines complicates fault tolerance and prevents effective resource utilization (memory based caching) and input-output (IO) optimization across multiple executors.
For example, in large jobs, it might be necessary to store shuffle data on disk to deal with potential faults. In this case, persisting (storing) data closer to a consumer executor (an executor executing a job which is a consumer job and receives data from a producer job) would optimize network usage as data is sent through the network continuously as opposed to small bursts (e.g., when new consumer tasks start execution). However, this pre-fetch optimization is not done because (for large jobs) the location of consumer task execution not known to each producer executor a priory.
Additionally, since the data exchange logic is contained within executors, both Spark and Tez rely on static scheduling of tasks to particular executors. This can lead to underutilization of faster processors and inefficient processing of jobs.