Advent of Web 2.0 has revolutionized Internet and Internet based applications. It has provided users with freedom to interact and collaborate over the Internet in ways not possible earlier. This has resulted in an explosive increase in user content generated by blogs, social media sites and various other Web 2.0 technologies. Contributing to this information explosion is the already accumulating business intelligence data that is generated everyday across various companies and industries. Such huge volumes of data have necessitated the creation of paradigms which are storage-aware in order to load and retrieve data efficiently. One such paradigm is map reduce model.
Map Reduce model refers to a programming paradigm to perform parallel computations over distributed (typically, very large) data sets (See Jeffrey Dean and Sanjay Ghemawat, “Map Reduce: Simplified Data Processing on Large Clusters”, OSDI 2004). The map reduce framework includes one or more mappers and one or more reducers. Each mapper performs a map job for dividing the input and generating intermediate result, and each reducer performs a reduce job for aggregating all the intermediate results to generate a final output. Map reduce model uses data parallelization to perform distributed computations on massive amounts of data using a clusters of servers.
Map reduce model is used to perform SQL like operations on huge volumes of data. In a typical analytics application, map reduce model is used to perform join operation to stitch factual data table (typically very large) to metadata table (typically much smaller than factual data table in size) and then perform further operations like projections, filtering and aggregations, etc. Unfortunately, the join operation is fairly expensive, as each join step involving a different join criterion requires its own complete map reduce phase.
There have been several approaches to optimize the join operation. One approach is called Map-Side join. In Map-Side join, the smaller table is loaded in memory of the servers performing the map job. During the map phase, a single tuple from the larger data is taken and the corresponding value of a join key is queried against the smaller table in memory. On finding a corresponding tuple in the smaller table, a join operation is performed between the two tuples. However, this approach fails to work when the size of smaller table is too large to be loaded in memory. This approach fails especially when dealing with metadata tables which are likely to grow steadily. Therefore, this approach is often practically infeasible.
Another such approach is called Semi Join. Semi Join refers to a type of Map-Side join where only those rows of large table, which will be actually required during the join operation, are transmitted during the map phase. Current implementation of Semi Join uses three map-reduce jobs to perform the join operation (See Patel et al., A Comparison of Join Algorithms for Log Processing in MapReduce, SIGMOD, 2010). The first map reduce job is used to identify all the distinct join keys present in the larger table and generate a look-up table using these keys. The second map reduce job identifies rows in the smaller table which have join keys corresponding to the values in the look-up table. The third map reduce job loads the identified rows in memory and performs Map-Side join with the larger table. However, using three map reduce jobs involves shuffling a large volume of data across the network and therefore is time consuming. Moreover, using three map reduce jobs is expensive in terms of memory consumed and processor time used. In addition, when the number of identified rows is large, it would be infeasible to load all the identified rows in memory.
Another such approach is called Per Split Semi Join. Per Split Semi Join refers to a type of Semi Join, where the Semi Join operation is performed for a segment of the larger table and not the entire table. Often large tables are stored by splitting them and storing across various servers. The resulting segment is called a split. However, this approach uses three map reduce jobs and therefore, suffers from the disadvantages mentioned above. Moreover, due to the existence of multiple splits, a tuple of the smaller table having a join key in both the splits will be repeated. This results in redundancies and often causes an explosion in the size of the resulting look up tables.
In light of the above discussion, there is a need for a method and system which overcomes all the above stated problems.