An increasing number of data-intensive distributed applications are being developed to serve various needs, such as processing very large data sets that generally cannot be handled by a single computer. Instead, clusters of computers are employed to distribute various tasks, such as organizing and accessing the data and performing related operations with respect to the data. Various large-scale processing applications and frameworks have been developed to interact with such large data sets, including Hive, HBase, Hadoop, Spark, among others.
At the same time, virtualization techniques have gained popularity and are now commonplace in data centers and other computing environments in which it is useful to increase the efficiency with which computing resources are used. In a virtualized environment, one or more virtual nodes are instantiated on an underlying physical computer and share the resources of the underlying computer. Accordingly, rather than implementing a single node per host computing system, multiple nodes may be deployed on a host to more efficiently use the processing resources of the computing system. These virtual nodes may include full operating system virtual machines, Linux containers, such as Docker containers, jails, or other similar types of virtual containment nodes.
To deploy the large-scale processing frameworks in a computing environment, administrators and users are often required to manually configure the frameworks to operate on the physical and virtual nodes of a cluster. This manual configuration of each of the processing frameworks can be time consuming and cumbersome as each iteration of the processing frameworks may take different actions for the configuration, such as determining addressing and computing resource requirements. This configuration difficulty is further compounded with the use of edge services, such as Splunk, Graylog, Platfora, or some other visualization and monitoring services, which communicate with the large-scale processing framework nodes within the cluster to provide control and feedback to administrators and users associated with the processing cluster. In particular, these edge services may require configuration information not only for the edge service, but also configuration information for the associated large-scale processing cluster.