The present disclosure relates generally to memory access devices and, more specifically, to data shuffling in non-uniform memory access devices.
Non-uniform memory access (NUMA) architectures have begun to emerge as architectures for improving processor performance, such as in multi-core processors. In a NUMA architecture, each socket or processing node has its own local memory, such as dynamic random access memory (DRAM), and each socket or processing node is connected to the other sockets to allow each socket to access the memory of each other socket. Thus, in NUMA architectures, access latency and bandwidth vary depending on whether a socket is accessing its own local memory or remote memory of another socket or processing node.
At some point in the execution of an application, threads executing on the processing nodes have to exchange intermediate results, including one or both of instructions and non-instruction data, with threads executing on other processing nodes. To exchange the results, the data is copied to the local memory associated with the destination thread. The copying is performed during a shuffle operation in which each thread exchanges data with some other thread. The shuffling is a global barrier for all participating threads. The shuffling starts after all threads have reached the barrier, and the threads resume processing only after shuffling among all of the threads is complete.