Developments in computer and networking technology have given rise to applications that require massive amounts of data storage. For example, millions of users can create web pages and upload images and text to a social media website. Consequently, such a social media website can accumulate massive amounts of data each day and therefore need a highly scalable system for storing and processing data. Various tools exist to facilitate such mass data storage.
HBase is a popular open source database that supports large-scale data-intensive distributed applications, by enabling applications to interact with a cluster of thousands of computers (also referred to as nodes) and petabytes of data. HBase is designed to manage large-scale structured datasets. In HBase, the data-intensive distributed applications store their data in structured datasets called HBase tables, where the tables are made of rows and columns. The HBase tables are typically developed using Java and may be, for example, modeled after Google's Bigtable. Data tables in HBase are partitioned into data segments called regions, where a contiguous range of rows of an HBase data table forms a data congruent region. In other words, the regions in the HBase database are the equivalent of range partitions in relational database management system (RDBMS).
Further, in at least some HBase implementations, HBase runs on top of a Hadoop framework, providing Bigtable-like capabilities for Hadoop. Hadoop uses a distributed, scalable, portable file system, called Hadoop Distributed File System (HDFS), to distribute a massive amount of data among data nodes (also referred to as slave nodes) in a Hadoop data node cluster. Hadoop includes a metadata node to host a file system index. In order to reduce the adverse impact of power outage or network failure (including switch failure) of a data node, Hadoop typically replicates stored data on different data nodes.
More specifically, one or more replicas of a region of data are created. In Hadoop, the replicas may be effectively stored across different data nodes by splitting the region into smaller segments of data called data blocks. The data blocks of a particular region are stored across the different data nodes. Such scattered data blocks lack the data congruity provided by the regions, where each region includes a contiguous range of rows of a data table.