An increasing number of data-intensive distributed applications are being developed to serve various needs, such as processing very large data sets that generally cannot be handled by a single computer. Instead, clusters of computers are employed to distribute various tasks, such as organizing and accessing the data and performing related operations with respect to the data. Various applications and frameworks have been developed to interact with such large data sets, including Hive, HBase, Hadoop™, Amazon S3, and CloudStore™, among others.
At the same time, virtualization techniques have gained popularity and are now common place in data centers and other environments in which it is useful to increase the efficiency with which computing resources are used. In a virtualized environment, one or more virtual machines are instantiated on an underlying computer (or another virtual machine) and share the resources of the underlying computer.
However, where a single processing job is running on multiple host systems of a virtual data-processing environment, configuration of the applicable host systems and virtual machines is complex. The configuration will be even more complex when taking into account that the single processing job may be running on multiple virtual machines on multiple host systems. Consequently, small variations or errors in the configuration may result in delayed and/or improper processing of the single processing job. Further, the single processing job may need to obtain data from a plurality of storage systems. The storage systems may have different capacities. The storage systems may use differing object access protocols. The storage systems may differ in the format and/or content of required access credentials.
Overview
A method of operating a cache service to interface between a virtual machine cluster and job data associated with a job executed by the virtual machine cluster includes identifying a request initiated by the virtual machine cluster to access at least a portion of the job data in accordance with a first distributed object access protocol. The method further includes in response to the request, accessing at least the portion of the job data in accordance with a second distributed object access protocol, and presenting at least the portion of the job data to the virtual machine cluster in accordance with the first distributed object access protocol.