Apache Hadoop is a popular application for big data storage and processing. However, deployment and management complexities in a physical environment prevent many enterprises from using Hadoop. Virtualizing Hadoop overcomes such difficulties by providing rapid deployment and easy management.
Cloning may be used to deploy virtual machines implementing Hadoop clusters. A clone is a copy of an existing virtual machine (VM). There are two types of clone: a full clone and a linked clone. A full clone is an independent copy of a parent VM that shares nothing with the parent after the cloning operation. A linked clone is a copy of a parent VM that stores differences between the cloned VM and the parent VM in a delta virtual machine disk. Any data not in the delta disk is looked up from the virtual machine disk of the parent VM.
For good performance and reliability, full clones distributed across different hosts are used for Hadoop clusters. Unfortunately traditional cloning is a one-to-one process. A Hadoop cluster including hundreds or thousands of VMs would take a very long time to create. For example, cloning 150 VMs would take more than 24 hours.