The present invention relates to data processing and, more specifically, to boosting big data processing with the software defined network such as the OpenFlow network.
The Hadoop MapReduce platform is widely used in today's data-intensive applications deployed in various organizations such as governments, enterprises, and research organizations. However, the native task scheduler in Hadoop MapReduce assigns tasks to servers mainly based on the availability of the servers with limited consideration of the network situation. Moreover, MapReduce applications usually involves massive data movement across server racks and thus have high requirement of the performance of the underlying network infrastructure. As a consequence, the native Hadoop scheduler suffers from bursty network traffic, network congestion, and degraded application performance.
Existing efforts to enhance the performance of Hadoop MapReduce mainly focuses on modifying Hadoop's JobScheduler to take into account data locality and network congestion or improving its failure recovery and replication mechanisms. There is no existing work that addresses this problem from the network side.
We approach the problem from a completely new perspective. Specifically, we develop an intelligent network middleware, called FlowComb, that can dynamically and proactively change the routing of network flows to avoid network congestion based on real-time prediction of data movement between the Mappers, Reducer, and Hadoop Distributed File System (HDFS) nodes. One example of enabling technologies of FlowComb is OpenFlow, which allows in-flight routing reconfigurations of individual flows. Then, FlowComb is a middleware sitting between the OpenFlow controller and the Hadoop JobTracker. By retrieving data movement information from Hadoop JobTracker, FlowComb is able to identify network hotspots, change flow scheduling to resolve congestion, and then notify the OpenFlow controller to enforce the new flow scheduling.
References:
[1] Mohammad Al-fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vandat, “Hedera: Dynamic flow scheduling for data center networks,” in Proc. USENIX NSDI, April 2010.
[2] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris, “Reining in the Outliers in Map-Reduce Clusters using Mantri,” in Proc. USENIX OSDI, 2010.
[3] Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou, “Re-optimizing Data Parallel Computing,” in Proc. USENIX NSDI, 2012.
[4] Paolo Costa, Austin Donnelly, Antony Rowstron, and Greg O'Shea, “Camdoop: Exploiting In-network Aggregation for Big Data,” in Proc. USENIX NSDI, 2012.