The present invention relates generally to communication among accelerators, and more particularly, but not by way of limitation, to a system, a method, and a recording medium for topology-aware parallel reduction in an accelerator.
How to efficiently communicate among multiple accelerators that are used in several industries.
Conventional techniques merely map a unit of work to each accelerator to work independently, and then reduce or summarize the results from the map phase. However, in the reduction phase, since accelerators are generally very fast, the accelerators are largely idle in the reduction phase waiting for data to arrive, which creates a waste of the expensive and powerful computing capacity.
That is, conventional techniques for the synchronization of accelerators are not optimized in that they are not parallelized, such that the reduce task on one accelerator has to wait for tasks from other accelerators to complete and transfer data. Also, the accelerators do not efficiently leverage the full-duplex PCIe bandwidth since multiple accelerations on a machine are usually connected through a PCIe bus. PCIe (Peripheral Component Interconnect Express) bus is a communication bus to connect devices such as I/O devices, and accelerators including GPUs. The PCIe channels are full-duplex and the conventional techniques merely transfer data in one direction and further increase the wait time. An accelerator is a hardware device designed to improve the performance of certain computational operations. Examples include graphics processing units (GPUs) which do graphics processing faster than GPUs; a field-programmable gate array (FPGA) that is to speed up certain computation intensive tasks. Reduce, or reduction is a type of operation that summarizes the results from a map phase in which operations are performed in parallel by multiple workers on computing nodes. Examples of reduce operations include summation, group, and sorting.
Accelerators can be connected to computing node via PCIe devices. If multiple ac accelerators are connected to a single Me, we use the term “intra-root” if the accelerators communicate with among them. The term “intra-node” is used for communication between accelerators if they are not on the same PCIe but on the same machine or computing node. The term “inter-node” is used for communication between accelerators on different machines.
Thus, the present inventors have recognized that, the above conventional systems, and other conventional accelerator systems, are limited in their applications in that they utilize only one direction of the full-duplex PCIe boards and the accelerators are not parallelized in any manner that reduces waste of computer resources due to idling of faster components.