“Big Data” is often used to refer to a large amount of data. Big Data provides an enhanced approach for business insights. Business insights may be described as spotting problems and opportunities from big data. The Apache™ Hadoop® software framework may be described as a software framework that supports Big Data and that consists of a Hadoop Distributed File System (HDFS™), an Apache™ Hadoop® file system, and Map-Reduce processes (which is a programming model for data processing). (Apache, Hadoop, and Hadoop Distributed File System (HDFS) are trademarks or registered trademarks of the Apache Software Foundation in the United States and/or other countries.) Since the volume of data that the Apache™ Hadoop® software framework handles may be of the internet scale, data should be moved in and out of the HDFS™ efficiently. In many usage scenarios, an Extract, Transform, and Load (ETL) tool is used to bridge the Apache™ Hadoop® software framework and data sources (Relational Database Management Systems (RDBMSs), flat files, etc.), and transform and enrich the data as it flows through.
A parallel application (e.g., an ETL tool) might invoke multiple processes to carry out a task. The defined task may be referred to as a job. The processes of a job of the parallel application may run in parallel in a computer or in multiple computers in a cluster. The job may interact with HDFS™ through an HDFS™ connector. The HDFS™ connector may run in its own process. Multiple instances of the HDFS™ connector may be invoked for parallel execution. When the connector of the parallel application runs in the data source's nodes, it is called running in local mode, whereas, when the connector of the parallel application runs outside of the source's nodes, it is call running in remote mode.
To understand the performance characteristics of the parallel application, assume that a cluster is arranged with 6 nodes, 1 dedicated for the parallel application, 1 dedicated for a name node, and 4 for data nodes. Each of the data nodes stores data. The name node maintains a directory tree of the files in the HDFS™ and tracks where across a cluster the file data is kept. The parallel application contacts the name node to access a file, and the name node returns a list of the one or more data nodes that store the file data. When the parallel application runs in remote mode, the parallel application only runs in the dedicated computer for this parallel application, whereas, when the parallel application runs in local mode, the HDFS™ connector of the parallel application will run in the data nodes.
When running in local mode (running the HDFS™ connector of the parallel application in the data nodes), because there is no guarantee for data locality, read operations may incur excessive network and Input/Output (IO) activities.
For example, for write operations, when running the parallel application in local mode, the total CPU utilization of the data nodes is 211%, compared to 56% when the parallel application is running in remote mode; whereas the CPU utilization of the dedicated node for the parallel application goes down from saturation (90%) to only 1%. This is an indication that the parallel application offloaded the work to the data nodes by running the HDFS™ connector of the parallel application in the data nodes. The network write throughput increased, indicating an increase of total workload throughput. Also, for read operations, network and disk Input/Output (I/O) activities increase. This is caused by the fact that data locality cannot be guaranteed when running the HDFS™ connector in the data nodes. The performance results were impacted by these system resource utilization patterns.
As for the performance comparison of the parallel application when running in remote mode versus local mode, while for write operations running the HDFS™ connector in local mode helps the parallel application performance, for read operations running HDFS™ connector in local mode degrades performance. For example, for write operations running the HDFS™ connector in local mode, performance improved 64% and 44% for 20 and 40 concurrent writers respectively. However, for read operations running the HDFS™ connector in local mode, performance is 19% and 25% as running remote mode for 20 and 40 concurrent readers respectively. This defeats the purpose of running HDFS™ connector in data sources to offload some work to the data source and improve performance.
Based on the aforementioned analysis, the poor performance of running the HDFS™ connector in local mode is caused by excessive network and disk I/O that was caused by the HDFS™ connector not always getting data blocks in the data nodes where the HDFS™ connector instance runs.
To complete a task, an application may send a sequence of jobs to run in a cluster. A job of a parallel application or some parts of the job may run in parallel across multiple nodes with data partitioning. The job may produce data that is consumed later by one or more downstream jobs. The data is saved as a file in HDFS™.
Where the jobs are run in the cluster may be determined based on certain constraints and resource management policies of a workload management system. An example of the constraints that limit the nodes on which a job can run is may be: 1) whether the job needs to access a remote database, and 2) whether the job must run in the nodes that are enabled for accessing a remote database. Since system resource utilization keeps changing dynamically, there is no guarantee that a downstream job that consumes the file runs in the same nodes as the job that produces the file does. If the downstream job runs in different nodes, then, the downstream job may not retrieve the data blocks locally and that may incur excessive network and I/O operations and cause poor performance. This is sometimes referred to as a data locality issue.
Certain conventional systems guarantee reading the data block locally by querying the name node to obtain the location of the block, then sending the task that reads the block to its known location. But this may not work for parallel applications because an operator handles the data of a whole partition that may be stored in multiple data blocks across multiple nodes in the cluster. The operator will be in the same container (or logical resource) and the same data node in the duration of processing the whole data of a partition.