Hadoop is a software framework for storing and processing a large amount of data in a distributed fashion on large clusters of commodity hardware. It provides massive data storage and parallel data processing.
Hadoop data stores (such as Hive and Hbase) are built on top of Hadoop distributed file system (HDFS) where HDFS provides scalable and reliable data storage. HDFS is designed to span large clusters of commodity servers and automatically replicate data for availability. On one hand, HDFS provides access and location transparency for data. On the other hand, the control of data placement is lost from the data store, and the efficiency and performance of queries in those data stores suffers. HDFS by default writes a copy of data locally where a client connects to the data store when writing a file and chooses the location of a replica of the data based on some system heuristics.
Some measures and technique have been taken by those data stores to remedy the situation; but, they still fall short when compared to traditional massively parallel processing (MPP) databases. For example, Hive supports bucketed table where a table is divided into buckets through clustering of keys, but the placement of buckets is still somewhat random based on HDFS heuristics. HBase stores rows of data in tables. Tables are split into chunks of rows called “regions” which are continuous ranges within the key space. Ranges can be user specified or auto generated. Even when tables are split into continuous ranges, joins are still performed by inefficiently shuffling data among all nodes with data.