Hadoop is a distributed processing platform that allows data-intensive operations to be processed in a distributed fashion. A Hadoop cluster commonly includes a master node and a group of worker nodes, such that a job request may be divided into a plurality of tasks, and the tasks may be distributed to a group of worker nodes within the Hadoop platform to be processed in parallel. Hadoop Distributed File System (HDFS) is a widely-used distributed file system that is designed to be highly fault tolerant, be deployed on inexpensive hardware, store large scale data sets, and stream those data sets at high bandwidth to user applications. A HDFS cluster includes a NameNode and a configurable number of DataNodes. The NameNode is a central server responsible for managing the NameSpace, which is a hierarchy of files and directories that clients access. A DataNode cluster is a server that stores and manages which is responsible for managing storage of the node where it resides.
Traditionally, using a solution for processing massive volumes of data like the combination of Hadoop and HDFS, data may be stored as replicas including one original part and two copies, which may require up to 200% overhead. One thus needs 3 TB memory space to store 1 TB of data, that is, 1 TB for original data and 2 TB for two replicas. Each replica may be stored on its own node or a separate server. When such data need to be processed, the Hadoop may locate the fastest node or server by response and perform necessary data processing using that node or server, thereby avoiding data transfers among difference entities via a network and the data processing undertaken is directed to the data. In handling big data workloads, however, Hadoop and HDFS may present themselves as a very high-overhead application because this solution is designed to utilize excess CPU capacity to conserve disk and network bandwidth.
Moreover, a common term one may hear in the context of Hadoop is “Schema-on-Read” which refers to the fact that raw, unprocessed data may be loaded into Hadoop, with the structure imposed at processing time based on the requirements of a processing application. Despite this powerful feature in handling raw data, there are a number of file formats and compression formats supported on Hadoop, and each may have particular strengths that make it better suited to specific applications. Additionally, although Hadoop provides the HDFS for storing data, there are several commonly used systems implemented on top of HDFS, such as HBase for additional data access functionality and Hive for additional data management functionality. Data format compatibility with these system would also affect system performance. As such, there are still important considerations to take into account with respect to the structure of data stored in distributed storage systems.
Accordingly, there is a need for a system and method for fast parallel data processing in distributed storage systems where data have a regular structure.