The present application relates generally to improving the throughput of a multi-server processing system. It finds particular application in conjunction with task scheduling in distributed compute systems using a map-reduce framework, and will be described with particular reference thereto. However, it is to be appreciated that the present application is also amenable to other like applications.
Map-reduce frameworks are a key technology for implementing big data applications. In these frameworks, a computational job is broken down into map and reduce tasks. The tasks are then allocated to a set of nodes (e.g., servers) so the tasks can be done in parallel. A map task processes a data block and generates a result for this block. A reduce task takes all these intermediate mapping results and combines them into the final result of the job.
A popular map-reduce framework is HADOOP® (registered TM of Apache Software Foundation). HADOOP® comprises a storage solution known as HADOOP® Distributed File System (HDFS), which is an open source implementation of the Google File System (GFS). HDFS is able to store large files across several machines, and using MapReduce, such files can be processed in a distributed fashion, moving the computation to the data, rather than the data to the computation. An increasing number of so called “big data” applications, including social network analysis, genome sequencing, and fraud detection in financial transaction data, require horizontally scalable solutions, and have demonstrated the limits of relational databases.
A HADOOP® cluster includes a NameNode (e.g. a node that keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept but does not store the data itself) and many DataNodes (e.g., a node that stores data). When a file is copied into the cluster, it is divided into blocks, for example, of 64 megabytes (MBs). Each block is stored on three or more DataNodes depending on the replication policy of the cluster, as shown in FIG. 1. Once the data is loaded, computational jobs can be executed over it. New jobs are submitted to the NameNode, where map and reduce tasks are scheduled onto the DataNodes, as shown in FIG. 2.
This is illustrated at a high level in FIG. 3. With reference thereto, NameNode 310 splits a job 330 into tasks 340. The tasks 340 are then assigned to individual DataNodes 320. There may be a multitude of DataNodes 320, and, in one embodiment, the multitude of DataNodes is in the range of a 10-1000s of DataNodes.
A map task processes one block and generates a result for this block, which gets written back to the storage solution. The NameNode will schedule one map task for each block of the data, and it will do so by selecting one of the three DataNodes that are storing a copy of that block to avoid moving large amounts of data over the network. A reduce task takes all these intermediate mapping results and combines them into the final result of the job.
One challenge with map-reduce frameworks, such as HADOOP®, is that most frameworks assume a homogeneous cluster of nodes (i.e., that all compute nodes in the cluster have the same hardware and software configuration) and assign tasks to servers regardless of their capabilities. However, heterogeneous clusters are prevalent. As nodes fail, they are typically replaced with newer hardware. Further, research has shown benefits to heterogeneous clusters, as compared to homogeneous clusters (see, e.g., Saisanthosh Balakrishnan, Ravi Rajwar, Mike Upton, and Konrad Lai. 2005. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proceedings of the 32nd annual international symposium on Computer Architecture (ISCA '05). IEEE Computer Society, Washington, D.C., USA, 506-517). Intuitively, more specialized hardware can better suit a variety of differing job resource profiles. By failing to account for heterogeneity, known map-reduce frameworks are not able to match jobs to the best compute nodes, consequently compromising global metrics, such as throughput or maximum delay.
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI'08). USENIX Association, Berkeley, Calif., USA, 29-42, investigates scheduling issues in heterogeneous clusters. However, it does not characterize HADOOP® jobs, but rather proposes a scheduling strategy that speculatively executes tasks redundantly for tasks that are projected to run longer than any other.
Further, while tasks belonging to the same job are very similar to each other in terms of their individual resource profile, tasks belonging to different jobs can have very different profiles in terms of their resource requirements (e.g. differing in the degree to which they utilize a central processing unit (CPU), memory, disk input/output (I/O) or network I/O or so forth). Jobs may also have certain service level requirements. Known map-reduce frameworks do not efficiently schedule tasks to satisfy service level requirements while optimally utilizing available resources.
The present application provides a new and improved system and method which overcome the above-referenced problems and others.