Distributed computing platforms, such as Hadoop™, include software that allocates computing tasks across a group, or “cluster,” of distributed software components executed by a plurality of computing devices, enabling large data sets to be processed more quickly than is generally feasible with a single software instance or a single device. Such platforms typically utilize a distributed file system that can support input/output (I/O) intensive distributed software components running on a large quantity (e.g., thousands) of computing devices to access a large quantity (e.g., petabytes) of data. For example, the Hadoop Distributed File System (HDFS) is typically used in conjunction with Hadoop—a data set to be analyzed by Hadoop may be stored as a large file (e.g., petabytes) on HDFS which enables various computing devices running Hadoop software to simultaneously process different portions of the file.
Typically, distributed computing platforms such as Hadoop are configured and provisioned in a “native” environment, where each “node” of the cluster corresponds to a physical computing device. In such native environments, administrators typically need to manually configure the settings for the distributed computing platform by generating or editing configuration or metadata files that, for example, specify the names and network addresses of the nodes in the cluster as well as whether any such nodes perform specific functions for the distributed computing platform (e.g., such as the “JobTracker” or “NameNode” nodes in Hadoop). More recently, service providers that offer “cloud” based “Infrastructure-as-a-Service” (IaaS) offerings have begun to provide customers with Hadoop frameworks as a “Platform-as-a-Service” (PaaS). For example, the Amazon Elastic MapReduce web service, which runs on top of the Amazon Elastic Compute Cloud (Amazon EC2) IaaS service, provides customers with a user interface to (i) provide data for processing and code specifying how the data should be processed (e.g., “Mapper” and “Reducer” code in Hadoop), and (ii) specify a number of nodes in a Hadoop cluster used to process the data. Such information is then utilized by the Amazon Elastic MapReduce web service to start a Hadoop cluster running on Amazon EC2 to process the data.
Such PaaS based Hadoop frameworks however are limited, for example, in their configuration flexibility, reliability and robustness, scalability, quality of service (QoS) and security. For example, such frameworks may not address single point of failure (SPoF) issues in the underlying distributed computing platform, such as the SPoF represented by the NameNode in Hadoop. As another example, such frameworks are not known to provide user-selectable templates, such that a preconfigured application environment with a known operating system and support software (e.g., a runtime environment) can be quickly selected and provisioned.