With the development of computer technologies, computers are applied in increasingly wider fields. In addition to common computer office applications in everyday life, the computers are also applied in some very complex fields, such as large-scale scientific computing and massive data processing, which usually have higher requirements on the processing capability of the computers. However, the processing capability of a single computer is limited, and is likely to become a bottleneck of improving system performance in the foregoing large-scale computing scenarios, and this problem is effectively solved as a cluster system emerges. The so-called cluster system is a high-performance system formed of multiple autonomous computers and relevant resources which are connected through a high-speed network, in which each autonomous computer is called a compute node. In a cluster, a CPU (central processing unit, central processing unit) of each compute node is designed as a general-purpose computing device, and therefore in some specific application fields, such as image processing and audio processing, processing efficiency is usually not high, so that many coprocessors emerge, such as a network coprocessor, a GPU (Graphics processing unit, graphics processing unit), and a compression coprocessor. These coprocessors may aid the compute node in task processing, that is, co-processing. A task where a coprocessor aids the compute node in processing is called a co-processing task. In a scenario of massive computation of the large-scale computer system, how to use the coprocessor to aid the compute node in co-processing has direct relation to the work efficiency of a computer system.
In the prior art, a coprocessor is mostly added into a computer system in a manner of a PCIE (Peripheral Component Interconnect Express, peripheral component interconnect express) co-processor card, a compute node of the computer system controls the coprocessor to process a co-processing task, and meanwhile a memory of the compute node is used as a data transmission channel of a co-processor card and the compute node, so as to transfer to-be-processed data and data which has been completely processed through the co-processor card.
By adopting such architecture in the prior art, all to-be-processed data has to be transferred through the memory of the computer node, which increase memory overheads, and due to the limits of factors such as the memory bandwidth and delay, a co-processing speed is not high.