Map-reduce is a programming framework to support distributed computing on large data sets on clusters of computers. The map-reduce framework includes one or more map-reduce jobs. Each map reduce job has a map phase for dividing the input and generating intermediate result and a reduce phase for aggregating all the intermediate results to generate a final output.
A map-reduce model is used for performing efficient database query mechanism. Further, a map-reduce model is used to perform SQL like operations on huge volumes of data. Furthermore, a map-reduce model is used for processing crawled documents, web request logs, etc. For example, for a database having information about employees' name and their respective identity numbers as a first data structure and information about employees' identity and their respective work department as a second data structure, if information about an employee, his identity number and his respective department has to be obtained, a map-reduce operation can be used. The map job partitions the first data structure according to employees' identity. Further, the map job accordingly partitions the second data structure according to employees' identity. Now, the partitioned data structures are given to a reducer. The reduce job reduces the partitioned data structures into a single output. The output obtained from the map job will be joined using a reduce key by the reduce job. The output of the reduce job would be the name of the employee, his identity and his respective department.
The output of the map-reduce operation is obtained after performing a join operation. The join operation joins the partitioned data in the required form. There are several approaches to optimize the join operation. One such approach is map-Side join. In map-Side join, the data structure is loaded in memory of the servers performing the map job. During the map phase, a single structure from the larger data is taken and the corresponding value of a join key is queried against the metadata in memory. However, this approach fails to work when the size of smaller data structure is too large to be loaded in memory. Therefore, this approach is often infeasible in an environment where memory is small.
Another such approach is Semi Join. Current implementation of Semi Join uses three map-reduce jobs to perform the join operation. However, using three map reduce operations involve shuffling a large volume of data across the network and therefore is time consuming. Moreover, using three map reduce operations is expensive in terms of memory consumed and processor time used. In addition, when the numbers of identified rows are large, it would be infeasible to load all the identified rows in memory.
Another such approach is called Per Split Semi Join. Per Split Semi Join refers to a type of Semi Join, where the Semi Join operation is performed for a segment of the larger table and not the entire table. Often large tables are stored by splitting and storing across various servers. The resulting segment is called a split. However, this approach uses three map reduce jobs and therefore, suffers from the disadvantages mentioned above. Moreover, due the existence of multiple splits, a data structure of the smaller table having a join key in both the splits will be repeated. This results in redundancies and often causes an explosion in the size of the resulting look up tables.
In light of the above discussion, there is a need for a method and system that overcomes all the above stated problems.