The Apache Hadoop project (hereinafter “Hadoop”) is an open-source software framework for developing software for reliable, scalable and distributed processing of large data sets across clusters of commodity machines. A Hadoop cluster typically comprises a name node and multiple data nodes. Hadoop implements a distributed file system, known as Hadoop Distributed File System (HDFS). HDFS provides a unified file system for the cluster, with the name node managing the name space of the unified file system, by linking together file systems on the data nodes. Hadoop also includes a MapReduce function that provides a programming framework for job scheduling and cluster resource management.
Hadoop is supplemented by other Apache projects, including Apache Zookeeper (hereinafter “Zookeeper”) and Apache HBase (hereinafter “HBase”). ZooKeeper is a centralized service for maintaining configuration information and naming. ZooKeeper also provides distributed synchronization and group services. HBase is a scalable, distributed Not-only Structured Query Language (Not-only SQL) or No Structured Query Language (NoSQL) database or data store that supports structured storage of large tables. Generally, an HBase installation includes a region server associated with each of the data nodes and depends on a ZooKeeper to coordinate the region servers. Each of the region servers works with data files called HFiles underlying the large tables, write ahead logs called HLogs, and other metadata in a data directory on the data node.
HBase supports several approaches of batch backups. One approach requires three MapReduce jobs, first to dump each table on a source cluster into a sequence file (export), second to copy a directory of files from the source cluster to a target cluster (dist cp), and third to save a sequence file into a table on the target cluster (import). Another approach requires one MapReduce job to read data from one table on a source cluster and write the data to another table on a target cluster (copy table). Unfortunately, these approaches all involve the execution of table manipulation commands that incur high latency and substantially impact existing workloads. Accordingly, it would be beneficial to have a more efficient approach for backup and related purposes, such as cloning and restoration.