1. Technical Field
This invention relates generally to a data warehouse framework. More specifically, this invention relates to a data warehouse built on top of MapReduce.
2. Description of the Related Art
MapReduce is a programming model that performs parallel analysis on large data sets as introduced by J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of OSDI, page 10, 2004. The input to an analysis job is a list of key-value pairs. Each job contains two phases, namely, the map phase and the reduce phase.
The map phase executes a user-defined map function, which transforms the input key-value pairs into a list of intermediate key-value pairs.
map(k1,v1)→list(k2,v2)
The MapReduce framework then partitions these intermediate results based on key and sends them to the nodes that perform the reduce function. The user-defined reduce function is called for each distinct key and a list of values for that key to produce the final results.
reduce(k2, list(v2))→(k3, v3)
The optional combiner function is quite similar to reduce function, which is to pre-aggregate the map output so as to reduce the amount of data to be transferred across the network. Many real world data processing jobs can be converted into MapReduce programs using these two simple primitives, such as search engine and machine learning.
While a MapReduce system is fairly flexible and scalable, users have to spend a lot of effort writing a MapReduce program due to lack of a declarative query interface. Also since MapReduce is just an execution model, the underlying data storage and access method are completely left to users to implement. While this certainly provides some flexibility, it also misses some optimization opportunity if the data has some structure in it. Relational databases have addressed the above issues for a long time: they have a declarative query language, i.e. SQL. The storage and data access may be highly optimized as well.