1. Technical Field
The invention relates to efficiently loading heterogeneous sources of data into a data warehouse with constantly evolving schemas. More particularly, the invention relates to a meta-data driven data ingestion using a MapReduce framework.
2. Description of the Background Art
In the big data field, a data warehouse is usually built on top of a scalable cluster system, such as Hadoop. Hadoop is an open source distributed computing environment using MapReduce. MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).
“Map” step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output, i.e. the answer to the problem it was originally trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel, although in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of reducers can perform the reduction phase, provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than commodity servers can handle. Thus, large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled, assuming the input data is still available.
The Hadoop File System (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single data node; a cluster of data nodes form the HDFS cluster. The situation is typical because each node does not require a data node to be present. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, e.g. 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. Such data warehouse is of the size of hundreds of terabytes or petabytes, and the schemas of the data warehouse are constantly evolving. One practical problem in using such a system is how to load heterogeneous sources of data efficiently into a data warehouse with constantly evolving schemas.