Big data analytics environments like Hadoop are employed in environments where the data is constantly growing and changing. For example, Apache Hadoop is an open-source software framework for storing and large scale processing of data sets on clusters of commodity hardware. Hadoop consists of the Hadoop Common package that provides file system and OS level abstractions, a MapReduce engine, and the Hadoop Distributed File System (HDFS). For effective scheduling of work, every Hadoop-compatible file system should provide location awareness. Hadoop applications can use this information to run work on the compute node where the data is to reduce backbone traffic. HDFS uses this method when replicating data to try to keep different copies of the data on different nodes. In multi-tenant environments, as the number of nodes and or users in the cluster increase, it becomes increasingly difficult to achieve this data locality.
Broadly speaking, the MapReduce programming model is divided into 3 distinct steps: a Map, a Shuffle, and a Reduce phase. Usually a distributed file system like HDFS is employed in conjunction with the MapReduce framework in which data is read from the HDFS during the Map phase, and results written to HDFS during the tail end of the Reduce Phase. The data during the Shuffle phase is usually termed intermediate data and is usually housed in local file systems on the nodes of the Hadoop cluster. HDFS splits a file into pre-configured fixed sized chunks (usually 64 MB or 128 MB), and these chunks are distributed across the nodes of the cluster in a uniform fashion. Usually three copies are made to achieve high availability. In certain cases, more copies are made in order to achieve high data locality while scheduling jobs.
Several techniques have been suggested for improving data locality in big data analytics environments. They range from “delay scheduling” to increasing number of replicas (copies) in order to achieve the same. “Delay Scheduling” suggests waiting for a previous running job to finish rather than schedule the new job in a node that is currently available, but does not have the data. This wastes processing cycles. Increasing number of replicas is yet another technique; however, it comes at a cost of increased storage.
Fixed large sized chunking also leads to the fact that even if more compute resources are available, they cannot be used to speedup jobs. As an extreme example, consider a file with one chunk of size 128 MB. Since this file is replicated three times, it can lie in a maximum of three compute nodes. The three copies allow for flexibility in choosing amongst the three nodes available to schedule. However, the maximum number of compute resources it can use is only one compute node even if the cluster is comprised of many more nodes.
FIG. 1 (prior art) illustrates a typical Hadoop cluster environment 100 with HDFS fixed sized chunking. Hadoop cluster environment 100 comprises a network switch/fabric 110, a first client terminal 101, a second client terminal 102, a master device 120, and a plurality of slave devices 121-123. The master device 120 comprises a control node 130 and a name node 140. Each slave device 121-123 comprises a compute node 131-133 and a data node 141-143 respectively. All the nodes are connected via an Ethernet network by the network switch/fabric 110. From system architecture point of view, control node 130 and compute nodes 131-133 form a map reduce layer, while name node 140 and data nodes form a HDFS layer. On the control node, the job tracker is responsible for scheduling and monitoring jobs via the scheduler. The name node presents the interface to the client terminals for writing and reading data to/from the HDFS layer as well as submitting jobs. The compute nodes provide the computing resources for executing jobs, and the data nodes provide the storage space for storing files and data. As explained earlier, HDFS splits a file into pre-configured fixed sized chunks (e.g., 128 MB), and these chunks are distributed across the three data nodes in a uniform fashion. For example, file F1 consists of three chunks {1, 2, 3}, and file F2 consists of two chunks {4, 5}. Note that in this particular example, the chunks do not have three copies as to keep the example simpler.
FIG. 1 also illustrates the logical flow when jobs are submitted. Client terminal 101 has submitted JOB1 associated with an input file F1, whereas client terminal 102 has submitted JOB2 associated with an input file F2. The job tracker accepts the jobs and schedules the jobs to be run by different tasks on the compute nodes. The tasks work in conjunction with the job tracker, reporting task status as well as starting new tasks. The scheduler in conjunction with the job tracker tries to schedule tasks on computed nodes where the data lies. In the example of FIG. 1, data node 141 stores chunks 1, 3, and 4, data node 142 stores chunks 2, 5, and 1, and data node 143 stores chunks 3, 4, and 5. As a result, JOB1 has three tasks 1A, 1B and 1C, task 1A and task 1B are scheduled on compute node 131, and task 1C is scheduled on compute node 132. Similarly, JOB2 has two tasks 2A and 2B, task 2A is scheduled on compute node 132, and task 2B is scheduled on compute node 133. However, if more compute nodes are available, they cannot be used to speed up the jobs. The number of nodes can be used for a job/file is limited to the number of chunks the file has.
Therefore, there exists a need for a method and apparatus to virtualize the file into dynamic chunks instead of fixed chunk sizes as is currently done in the distributed file systems today.