The present invention relates generally to large scale data parallel processing, and more particularly, to computer-guided holistic optimization of MapReduce applications.
MapReduce is a commonly used programming model for performing large scale data parallel computations on commodity server clusters. MapReduce API allows developers to specify data operations as map or reduce functions for data transformation and aggregation respectively. Actual mapping of data and code to the nodes in the distributed system is handled by the framework/runtime autonomously. Improving the runtime has, therefore, been an active area of research. HADOOP (HADOOP is a registered trademark of The Apache Software Foundation, Forest Hill, Md., USA) is the most popular open-source framework/runtime for MapReduce. It powers numerous web services including FACEBOOK, TWITTER, NETFLIX, AMAZON and YAHOO among others. (FACEBOOK is a registered trademark of Facebook, Inc., located in Palo Alto, Calif., USA.; TWITTER is a registered trademark of Twitter, Inc., located in San Francisco, Calif., USA.; AMAZON is a registered trademark of Amazon, Inc., located in Las Vegas, Nev., USA.; YAHOO is a registered trademark of Yahoo, Inc., located in Sunnyvale, Calif., USA.).
Despite advances in the underlying implementations of MapReduce (e.g HADOOP), the opportunities for optimizing the applications themselves remain largely unexplored. As mentioned before, map and reduce functions are the main building blocks of a MapReduce application and are defined by the developer. A better definition of these functions can lead to better performance. Although new APIs for improving performance get proposed every now and then, it is up to the developer to make use of these APIs. Since it requires deep understanding of the APIs as well as a lot of programming/debugging/testing effort on the part of the developer, oftentimes, performance improvement opportunities are missed by developers. We call these missed opportunities “performance bugs”. In addition to the application code itself, the numerous parameters (more than 150 for HADOOP) that need to be tuned for a given cluster configuration, are often left unoptimized causing further performance degradation.
Applicants, to the best of their knowledge, are not aware of any prior work on automatically fixing performance bugs in MapReduce/HADOOP applications. So far, the focus has been on improving the runtime performance and proposing new library extensions/APIs to be used by developers. Other's efforts attempt at optimizing iterative MapReduce applications through library extensions and define APIs for writing iterative algorithms. In contrast, the inventive technique herein identifies and formulates a compiler optimization that is independent of the implementation of map and reduce functions and automatically transforms a legacy MapReduce application yielding up to 3× speedup without user involvement.
Accordingly, there is a need for a method for automatically fixing performance bugs in MapReduce/HADOOP applications.