Internet companies usually have a large amount of user behavior data. Machine learning methods are usually used to extract useful information such as user preferences from such data. The extracted information can be used to improve user experience and increase revenues of internet companies.
A core approach of machine learning is to find the minimum of a loss function. A loss function is a function for measuring loss or error. For example, with respect to website search, if the loss function is smaller, it means the probability that a user clicks on a found webpage is higher. Gradient descent (a gradient is a vector, a derivative of a loss function with respect to the weights) is a widely used method for finding the minimum of a loss function in machine learning. For example, it is used in various optimization techniques, given its relatively simple implementation and fast calculation. A learning rate (also called Eta) is an important parameter in weight updates. A weight is a vector and may be understood as an independent variable of a loss function. Learning rates can affect convergence in the training process. If Eta is too large, each round of iteration may go too far and miss an optimal solution. If Eta is too small, each round of iteration may be too slow and affect the speed of convergence.
At present, when large-scale machine learning is used to solve problems, training is usually performed using clusters. A cluster can include a plurality of machines. Machines may have different loads at different time points. Some machines have high computing speeds. Some machines have light communication loads and accordingly high communication efficiency. However, some machines may have very high loads and consequently very low computing speeds. Some machines may also have very low communication speeds because of low-level configurations. As a result, the entire training process can be very slow. A large amount of machine resources are required in the training process, and enormous financial costs are incurred. For example, assuming 800 machines are needed to train one user preference, and the cost of one machine per hour is C, and the training takes a total of T hours, the cost would be 800×C×T. If C is greater than 1000 and T is greater than 100, the cost of successful training is at least 80,000,000. If a failure occurs during the training process, the training process needs to be started again, and the costs can be even higher.
In view of the above, there is a need for effective solutions to the foregoing problems.