An increasing number of data-intensive distributed applications are being developed to serve various needs, such as processing very large data sets that generally cannot be handled by a single computer. Instead, clusters of computers are employed to distribute various tasks, such as organizing and accessing the data and performing related operations with respect to the data. Various applications and frameworks have been developed to interact with such large data sets, including Hive, HBase, Hadoop, Amazon S3, and CloudStore, among others.
At the same time, virtualization techniques have gained popularity and are now commonplace in data centers and other computing environments in which it is useful to increase the efficiency with which computing resources are used. In a virtualized environment, one or more virtual machines are instantiated on an underlying host computer (or another virtual machine) and share the resources of the underlying computer. However, deploying data-intensive distributed applications across clusters of virtual machines has generally proven impractical due to the latency associated with feeding large data sets to the applications.
In some examples, virtual data processing nodes on host computing systems may operate independent of the required data storage repositories. Accordingly, any of the processing nodes within the environment may be used to process data from any of the storage repositories within the system. However, as the environments become more complex with more computing systems and data storage locations, inefficiencies may arise in the allocation of virtual nodes and job processes to the host computing systems.