Hadoop typically is used for processing large data sets across clusters of independent machines, and it has become one of the leading “big data” platforms for storing large quantities of unstructured data and supporting a range of tools and analytic functions.
Existing approaches to deploy virtualized Hadoop systems have certain limitations. For instance, Amazon Elastic MapReduce (EMR) is a web service that uses Hadoop to distribute data and processing across a resizable cluster of Amazon Elastic Compute Cloud (EC2) instances. However, Amazon EMR does not provide ways for users to customize deployment of Hadoop clusters or control allocation of underlying resources, leading to undesirable inefficiencies. In addition, due to the characteristics of the underlying physical storage systems in a virtualized computing system and how virtual disks are generally allocated, it has been observed that data I/O access (e.g., writing efficiency) in such a virtualized computing system, especially for big data, has not been fully optimized. Specifically, one example physical storage system may utilize rotational disk drives. The tracks on different areas of the surface of such a disk drive provide different I/O throughputs. Because of the longer radius, an outer track (corresponding to low logical block addresses (LBAs)) of the disk drive has a higher I/O throughput than an inner track (corresponding to high LBAs). However, existing approaches still fail to effectively utilize the tracks associated with higher I/O throughput, because the approaches may focus instead on preventing contention among hosts in a virtualized computing system.