In a direction of general-purpose computing, multiple GPUs (GPU processor) use a design concept totally different from that of a CPU processor, as shown in FIG. 1, different from a “multi-core” concept of the CPU, the design of the GPU moves towards a “many-core” direction, many cores consist of a large number of smaller cores, and with development of computers, the number of cores has been doubled constantly.
With rapid development of GPU hardware, massively parallel processor resources are aggregated on the GPUs, thus helping to map a parallel computing portion in the general-purpose computing into a GPU platform, to cause a GPU technology to accelerate parallel applications to become more and more popular. However, implementation of a deep neural network (DNN) system based on a single GPU is still serialization implementation because the degree of parallelism of current implementation solutions mainly exists in parallelization of matrix operations, to map tens of thousands of dimensional matrices into a GPU parallel computing scenario to enhance a single processing speed, but parallelism of processing data, between each batch of data computing and of the DNN per se is not taken into account. Faced with demands of deep networks with massive training data and complicated training, when GPUs are used to carry out training, due to serious insufficient performance existing in an experiment process, it often takes a week and even a few weeks to achieve model convergence, which cannot meet the demands for carrying out more tests for training large-scale networks. At present, it is very common to install a plurality of GPU cards on a server, and it is a more and more popular development direction to use a multi-GPU parallel acceleration technology to expand parallelism of compute-intensive applications and increase program performance in the field of general-purpose computing.
A data exchange model based on peer to peer has serious deficiencies in performance: more data exchange cycles are required when there are more than two parallel units, and waiting exists within each data exchange cycle, which does not make full use of idle bus bandwidth. It is necessary to innovatively implement a parameter exchange mechanism in a multi-GPU data parallel technology to solve the deficiencies.
After the performance problem of the model training is solved, it is further necessary to solve the convergence speed problem in the model training process, to further improve the training performance from the aspect of a training algorithm. In the existing technology, a fixed learning rate is used to update parameters, and a large number of tasks of manually regulating the learning rate and judging convergence mingle in the training experiment process, which is complicated, tedious and has low efficiency.