Big data (big data), or referred to as mega data, refers to data sets involving such a huge volume of data that it cannot be fetched, managed, processed, and arranged within a proper time by using a conventional software tool. With the advent of the cloud era, big data (Big data) attracts increasing attention, and how to obtain useful information and knowledge from big data becomes the focus of the industry. Data mining (Data Mining) is a technology for searching for hidden information from a large volume of data by using an algorithm. The data mining generally achieves the foregoing objective by using many methods such as statistics, online analysis and processing, information retrieval, machine learning, an expert system (relying on past experience and rules), and model identification.
In a data mining process, modeling and analysis generally need to be performed on massive data. A common modeling method includes an iterative machine learning algorithm, such as linear regression, logistic regression, a neural network, or a decision tree. A learning process is executed on data repeatedly, to continuously update a particular parameter of a data mining task. Each time a round of iterative computation is complete, an effect of a temporary model generated is estimated. When a particular condition is met, an iterative process ends. Otherwise, the iterative process is executed repeatedly.
However, big data has a feature of a big data volume, which generally reaches a TB (1 TB=1012 B) or PB (1 PB=1000 TB) level, and is beyond a computing capability of a civil computer. Therefore, a high performance computer and a distributed cluster are generally used to perform batch processing. That is, a mining task of big data is executed in a distributed cluster computing environment by using the foregoing iterative algorithm, and each round of iterative computing task is allocated to computing subnodes. When the computing subnodes complete respective computing tasks, temporary results of all the subnodes are gathered, and an effect of an obtained combination model is estimated. When a particular condition is met, an iterative process ends. Otherwise, a new computing task is reallocated to the computing subnodes, and the iterative process is repeated.
Since computing subnodes in a distributed cluster may have different computing capabilities, computing resources cannot be fully used and computing efficiency is reduced. Therefore, to improve performance of an entire mining system, in the prior art, a load balancing technology is used. When each round of iterative task is executed, a quantity of tasks of each computing subnode is dynamically adjusted according to a load status of the computing subnode. For example, in a process of executing an iterative task, when it is found that some computing subnodes have completed the iterative task, and some computing subnodes have not completed the iterative task, it is considered that the nodes that have completed the task are idle nodes, and the nodes that have not completed the task are overloaded nodes. In this case, some task data on the overloaded nodes is transferred to the idle nodes. However, in the prior art, a volume of input task data of each computing subnode in each round of iteration is unchanged, load balancing in each round of iterative task is independent with respect to a next round of iteration. That is, when the next round of iterative task is executed, load balancing is needed to be performed again. Because data needs to be transferred between nodes during load balancing, unnecessary network consumption is increased, and data mining performance of a system is reduced.