A complex, high-performance processor may have a core and coprocessors. The core offloads tasks to its coprocessors, which perform specialized functions for which they are optimized. The type and number of coprocessors depends on system aspects such as performance, the mix of tasks that need to be offloaded from the core, and power and size limits.
A challenge in developing scalable, distributed applications is efficient data exchange between participating computers of a distributed system. In large-scale systems, network performance and scalability are as critical as the performance of the computers. Some communication patterns such as all-to-all exchange are common within various application domains including high performance computing (HPC) and data analytics. However, these applications are traditionally difficult to optimize for network performance.
In the context of distributed query processing, some operations such as joins, aggregations, and sorts often need to repartition or otherwise redistribute the data across various computers in a system which may involve an all-to-all communication. The cost of data redistribution over a network may substantially increase query execution latency. As a result, efficient data redistribution is crucial for achieving high performance and scalability for distributed query processing. System throughput is further challenged by data skew, when computers have different amounts of data to send to each other during redistribution. Traditional techniques might work well for uniform data distribution, such as with shift-pattern communication, but perform poorly with data skew.
Some systems have non-blocking, high-bandwidth networks such as InfiniBand for reducing data communication time for an application. However, when the nodes perform all-to-all communication without any scheduling/ordering over an InfiniBand network, the bandwidth of the interconnect network is inefficiently utilized. As the number of communicating nodes in the system increase, the achievable network bandwidth may fall significantly below its peak.
The degradation in performance can be largely attributed to two reasons: 1) contention/congestion at a receiving endpoint, and 2) contention for a common inter-switch link in switches composed of a two-level fat tree. Contention at a receiving computer occurs when multiple senders simultaneously attempt to send data to a common receiver. As a result, all computers that attempt to send data to this common receiver may experience backpressure and degraded bandwidth.
Contention or oversubscription on inter-switch links occurs when communication between two or more independent pairs of computers makes use of the same inter-switch links which are used to connect the leaf and spine switches in a two-level fat-tree. Because each inter-switch link can only support the peak data rate between a single pair of computers, sharing of this link by more than a single pair leads to proportional reduction of bandwidth for each of the independent traffic flows that share a common link.
Furthermore, the above observations may be more or less applicable to bulk synchronous parallel (BSP) systems such as the shuffle phase of MapReduce. During a shuffle, each reducer pulls data from every mapper. Shuffle is another example of all-to-all data redistribution and may saturate switches. Data skew occurs when one reducer receives more data from mappers than another reducer receives during a same shuffle.