1. Field of the Invention
The present invention relates to a distributed parallel computing system. More particularly the present invention relates to an apparatus and method for combinatorial computing in a parallel computing system.
2. Related Art
The development of information technology provides increasingly rich and powerful applications and services, and at the same time brings forward increasing demands on the computing capacity of the processing equipments. In spite of the ever increasing running speed of processors, facing an immense amount of information and data, the distributed parallel computing becomes a practical solution.
The distributed parallel computing is a solution where a processing task is dispersed to a plurality of processors to be concurrently executed. Nowadays, many implementing modes for massively parallel computing exist, of which the most important and frequently used is the MapReduce model.
MapReduce is a concise parallel computing model, whose name originates from the two core operations in the model: Map and Reduce. The two concepts come from functional programming languages. Briefly speaking, Map is to map a set of data to another set of data one-to-one according to a mapping rule specified by a function defined by a user. Reduce is to combine and reduce a set of data according to a function defined by a user.
During the process of the Map, data are processed in parallel separately and independently, while during the process of the Reduce, the separated data are combined together. Therefore, by using the separation and combination of the Map and Reduce operations, we can divide a complex and huge task into many jobs to be computed in parallel, and then synthesize the results of parallel computing, thus obtaining the desired results.
FIG. 1 schematically shows the basic structure of the MapReduce model in the prior art. In the Map and Reduce model shown in FIG. 1, each mapping unit reads from the corresponding data source the input data in the format of key value pairs (k, v), and maps the input key value pairs (k, v) to new key value pairs (referred to as intermediate key value pairs) according to a function defined by users. After that, in the process of Reduce, the intermediate key value pairs with the same keys are sent to the same reducing unit, where the results are synthesized.
As described above, in the existing MapReduce model, the input data must be in the format of single key value pair. That is, the mapping unit can only operate on a single key value pair (k, v), and therefore only supports the input of single data source. For many applications, however, the requirement of inputting in the format of single key value pair is too strict for the parallel computing design. In fact, in many applications, a plurality of data sources are used as input data, and it is desired to perform combinatorial computing on the plurality of data sources. In this case, the existing MapReduce model has great limitation. Next, the combinatorial computing on several sets of input data will be illustrated in conjunction with two examples.
In one example, the parallel computing system is used to configure an array antenna. Since all the information relating to the array antenna is stored in the form of matrix, the computing system needs to perform various kinds of calculations on large matrix. For a m*s Matrix A, if we multiply A by a constant λ, according to the existing MapReduce model, we can set the input key value pair (k1, v1) as k1=the row number of the matrix, v1=the matrix elements of the corresponding row, and set the mapping function as f (k1, v1)=(k1, λv1), thereby obtaining the mapped key value pair (k2, v2)=f (k1, v1), which stands for the result of multiplication by a constant.
However, if we multiply Matrix A by another s*n Matrix B, according to the definition of matrix multiplication, it is inevitable to perform combinatorial operations on the elements of Matrices A and B simultaneously, that is, it is desired to take the elements in two matrices simultaneously as input data. In the prior MapReduce computing system, as the mapping unit can only receive single key value pair as the input, the programmers usually have to split and distribute the elements in Matrix B via complex algorithms, and design complex input key value pairs to achieve the multiplication of two matrices.
In another example, the parallel computing system is used to achieve recommending function, which is extensively applied to various online shops. In particular, after a user m purchases a commodity n, the system can record the rating Rm-n on the commodity n by the user m to analyze the similarity S between commodities.
On the basis of obtaining similarities among any commodities, when a user is purchasing a commodity I, the system can select by calculation, the commodity having the greatest similarity S with the commodity I, and recommend it to the user.
Generally, the rating data for use to calculate similarities are recorded in shared files such as HDFS (Hadoop Distributed File System) in the format of matrix, tables, etc.
For calculating similarities among commodities, in one algorithm, the similarity between Commodity I and Commodity j is defined as:
                              S          ⁡                      (                          Comi              ,              Comj                        )                          =                                            ∑              Userm                        ⁢                                          R                                  m                  -                  i                                            ×                              R                                  m                  -                  j                                                                                                                          ∑                  Userm                                ⁢                                  R                                      m                    -                    i                                    2                                                      ×                                                            ∑                  Userm                                ⁢                                  R                                      m                    -                    j                                    2                                                                                        (        1        )            
Obviously, the calculation of the above similarity relates to the rating on the two Commodities i and j by users. However, as described above, the prior MapReduce system can only read single key value pair as the input. Therefore, we usually take the ratings on a Commodity i by various users as the input data, that is, to set the input key value pair as (Commodity i, (User 1, R1-i) (User 2, R2-i) (User 3, R3-i) . . . ).
In order to calculate the similarity between Commodities i and j, in one solution, the parallel computing unit in the MapReduce system reads Commodity j-related data from the HDFS shared files, and combines the data with the received rating information on Commodity i to calculate the similarity.
However, as all the computing units need to access the HDFS system via network, such a solution can result in enormous network 10, and therefore reducing the calculating capacity. In another solution, MapReduce system first converts the received key value pair data, and by indexing based on users, configures them as (User m, (Commodity 1, Rm-1) (Commodity 2, Rm-2) (Commodity 3, Rm-3). . .), thereby obtaining Rm-i and Rm-j, and then, performs summation by traversing m according to the formula (1).
Nevertheless, the above process of conversion and calculation can result in a large number of intermediate key value pairs. These intermediate key value pairs need to be distributed among various computing units of the MapReduce system, thus resulting in the risk of IO obstructions in the system, and reducing the computing capacity and the execution efficient.
Besides the two examples mentioned above, there are still many applications depending on multiple data sources. Because of the limitations of the prior MapReduce parallel computing system, the execution of these applications can face problems similar to the above examples. Therefore, it is desired to provide a solution which can improve the prior parallel computing to further enhance the computing capacity.