In recent years, enterprises have seen substantial growth in data volume. Since enterprises continuously collect large datasets that record information, such as customer interactions, product sales, results from advertising campaigns on the Internet, many enterprises today are facing tremendous challenges due to the dramatic growth in data volume. Consequently, storage and analysis of large volumes of data have emerged as a challenge for many enterprises, both big and small, across all industries.
Recently, a data analytics framework, called Hadoop, has been widely adopted due to its capability of handling large sets of structured as well as unstructured data. The Hadoop is an open source programming framework for distributed computing with massive data sets using a cluster of multiple nodes. The Hadoop includes a Hadoop Distributed File System (HDFS) as a data storage layer and a Hadoop MapReduce framework as a data processing layer. The large data sets written to the HDFS are chunked into blocks; each block is replicated and saved on different nodes. The Hadoop MapReduce framework is used for processing the large data sets. In the MapReduce, a large data set is converted into output key-value pairs and the output key-value pairs are then stored into a distributed database, such as an HBase, as one or more output files.