Generally, it may take several hours or more time to process hundreds of Gigabits (GB) of data, e.g. log file, when using a common computer, while it may take hundreds of hours to process dozens of Terabits (TB) of data, which is too long. People adopt a data distributing method to solve the problem; the data distribution includes that a set of all data is called table, a set of a part of all data is called sub-table, a table may include a number of sub-tables, and a number of sub-tables are distributed to a number of computing nodes, e.g. 100 computing nodes; each computing node is only for computation of a part of all data. Thus, these computing nodes can compute simultaneously, which can reduce the processing time greatly.
A data distributing system usually includes a management node and a number of computing nodes; as shown in FIG. 5, the management node and the computing nodes are interconnected through a network. The management node is responsible for distribution of data, scheduling of computing tasks and so on; the management node is always coincident with a data storage management node and a cluster management node physically, wherein the data management node records the number of sub-tables of a table, the location of sub-tables and other information, and the cluster management node records the addresses of nodes, memory size, CPU capability and other information. The computing node is responsible for storing sub-tables in a local hard disk or memory, executing a task assigned by the management node, and saving an execution result in the local hard disk or memory. The management node and the computing node can be either a computer or a processor.
For example, {names of all Chinese people} is a table which includes 1.6 billon names; since the data amount of the table is too large, it is needed to divide the table into some sub-tables, the total number of names in each sub-table is not more than 10 thousand. In this way, sub-tables {{people surnamed Zhang 1}, {people surnamed Zhang 2}, . . . {people surnamed Qian 1}, . . . } are formed, a table includes 160 thousand sub-tables. These sub-tables are distributed to 1000 computing nodes; the first 160 sub-tables may be stored at the first computing node, and so on. This process is data distribution. Placing a sub-table on a computing node needs to be completed through network transmission; if a computing node stores 160 sub-tables, and then, it is needed to transmit a data amount of 160 sub-tables to this computing node; the larger the data amount is, the heavier the burden of network transmission is.
Cross join is a standard operation command between two tables, and its function is similar to the multiplication in daily arithmetic, but the cross join is a product (Cartesian product) of two tables. For example, table 1 is (5, 4), table 2 is (3, 2, 6), that is, the table 1 includes two elements, namely 5 and 4, and the table 2 includes three elements, namely 3, 2 and 6. In daily arithmetic, 5×5=25, but the result of (5, 4) cross join (3, 2, 6) may be (5×3, 4×3, 5×2, 4×2, 5×6, 4×6)=(15, 12, 10, 8, 30, 24), or (5+3, 4+3, 5+2, 4+2, 5+6, 4+6)=(8, 7, 7, 6, 11, 10). All results obtained from these examples form a table, and each of the obtained tables includes 2×3=6 elements.
Suppose that there are two computing nodes, namely computing node N1 and computing node N2; if the elements in the table 1 and the table 2 are taken as sub-tables, then the sub-tables {5}, {3}, {2} are distributed on N1, and the sub-tables {4}, {6} are distributed on N2. When it is needed to perform cross join computation between two tables, namely ({5}, {4}) cross join ({3}, {2}, {6}), the following problem will occur: completing the cross join computation needs to perform computation between {5} and {6}, computation between {4} and {3}, and computation between {4} and {2}, but the two sub-tables are at different computing nodes, so that computation cannot be performed. At present, for performing the cross join computation normally, each computing node is usually enabled to store all the sub-tables in one table, and the sub-tables in another table are evenly distributed on each computing node, for example, both N1 and N2 store {3}, {2}, {6}, N1 stores {5}, and N2 stores {4}; or, both N1 and N2 store {5} and {4}, N1 stores {3}, {2}, and N2 stores {6}; there are other data distributing modes. However, such existing data distributing mode will cause a very large data amount needed to be transmitted to the computing node, so the burden of network transmission is very heavy; especially when the more the computing nodes to which data are distributed are, and the closer the numbers of the sub-tables of two tables are, the more obvious the problem of large data amount is. Currently, there is no technology which can decrease the amount of data transmitted.