The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
In various applications, e.g., training deep neural networks, graph analysis, a large number of parallel operations needs to be performed on a large amount of data. Software techniques may be used to provide solutions to such parallel applications. For example, multiple threads may be used to perform the parallel computations, with a butterfly pattern of communication between threads. However, such software techniques may require multiple reads and writes to memory, resulting in less than desirable performance, due to limited memory bandwidth.