An increasing number of data-intensive distributed applications are being developed to serve various needs, such as processing very large data sets that generally cannot be handled by a single computer. Instead, clusters of computers are employed to distribute various tasks, such as organizing and accessing the data and performing related operations with respect to the data. Various large-scale processing applications and frameworks have been developed to interact with such large data sets, including Hive, HBase, Hadoop, Spark, among others.
At the same time, virtualization techniques have gained popularity and are now commonplace in data centers and other computing environments in which it is useful to increase the efficiency with which computing resources are used. In a virtualized environment, one or more virtual nodes are instantiated on an underlying physical computer and share the resources of the underlying computer. Accordingly, rather than implementing a single node per host computing system, multiple nodes may be deployed on a host to more efficiently use the processing resources of the computing system. These virtual nodes may include full operating system virtual machines, Linux containers, such as Docker containers, jails, or other similar types of virtual containment nodes.
To deploy the large-scale processing frameworks in a computing environment, administrators and users are often required to manually configure the frameworks to operate on the physical and virtual nodes of a cluster. This manual configuration of each of the processing frameworks can be time consuming and cumbersome as each iteration of the processing frameworks may take different actions for the configuration, such as determining addressing and computing resource requirements. This configuration difficulty is further compounded with the use of edge services, such as Splunk, Graylog, Platfora, or some other visualization and monitoring services, which communicate with the large-scale processing framework nodes within the cluster to provide control and feedback to administrators and users associated with the processing cluster. In particular, these edge services may require configuration information not only for the edge service, but also configuration information for the associated large-scale processing cluster.
Overview
The technology disclosed herein provides enhancements for generating large scale processing framework (LSPF) images for deployment in processing environments. In one implementation, a method of preparing LSPF service images for large scale data processing environments includes identifying a first LSPF service image, and identifying metadata that defines runtime requirements for deploying the LSPF service in data processing environments. The method further provides generating scripts for deploying the LSPF service based on the metadata, and generating a second LSPF service image for the LSPF service, wherein the second LSPF service image includes the scripts.
This Overview is provided to introduce a selection of concepts in a simplified form that are further described below in the Technical Disclosure. It should be understood that this Overview is not intended to identify key features or essential features of the claimed subject matter, nor should it be used to limit the scope of the claimed subject matter.