A massively parallel processor (MPP) is one type of supercomputer. An MPP consists of a large number of independent computing nodes (processors and memory) interconnected with a specialized high-speed network. The number of nodes in a supercomputer can be in the thousands. An application or task running on an MPP is divided into many subtasks, each of which executes on its own node. The subtasks execute in parallel, each subtask computing a portion of the final result. These individually computed results in general need to be combined multiple times during the execution of the overall application, with the combined intermediate result being sent back to each of the nodes running the subtasks of the application.
Frequently, the nodes participating in an application running on an MPP are organized into one or more logical-tree structures. In what may be conceived of as an inverted-tree structure, the subtasks of the application run on the leaves of the tree (the lowest level of the tree). Reduction operations are performed when partial or final results need to be combined, that is, data are sent up the tree from the leaves to intermediate tree nodes, where the data from several individual leaf nodes are combined. Each of the intermediate results is sent up to the next level of the tree, where again several pieces of data are combined. This process continues until the root of the tree is reached and a single reduction result is computed. The reduction result can then be sent back down the tree (a scatter operation) to all of the participating nodes. The combinatorial operations performed at each level of the tree may be arithmetic (sum, min/max) or logical (AND, OR, XOR) (together referred to as the specified arithmetic-logical reduction operations). The scatter operation can also be used as a broadcast to send data from a root node to all of the leaves. Together, reduction operations and scatter operations are known as “collective operations.”
In some applications, the overall performance of the application can be limited by the time to move data over the network between nodes, and by the time to perform the reduction operations. Therefore, application performance can be improved by providing in some embodiments a network designed to more efficiently move data between the nodes of a tree. Further performance improvements can result from providing hardware to perform the collective operations as the data moves up and down the tree, rather than performing the collective operations in software.
There remains a need in the art for an improved engine and method for performing collective operations in a multiprocessor.