1. Field of the Invention
This invention relates to electrical digital data processing. More particularly, this invention relates to protocols for transmission and synchronization of digital data across a network.
2. Description of the Related Art
The meanings of certain acronyms and abbreviations used herein are given in Table 1.
TABLE 1Acronyms and AbbreviationsALUArithmetic Logical UnitASICApplication Specific Integrated CircuitCPUCentral Processing UnitEDREnhanced Data RateGDCGroup Database CacheHCAHost Channel AdapterHPCHigh Performance ComputingMP1Message Passing InterfaceOOCOutstanding Operation ContextOOTOutstanding Operation TableRCQPReliable Connected Queue PairRDMARemote Direct Memory AccessRoCERDMA over Converged EthernetSHArPScalable Hierarchical Aggregation ProtocolUDUnreliable Datagram
Modern computing and storage infrastructure use distributed systems to increase scalability and performance. Common uses for such distributed systems include: datacenter applications, distributed storage systems, and HPC clusters running parallel applications While HPC and datacenter applications use different methods to implement distributed systems, both perform parallel computation on a large number of networked compute nodes with aggregation of partial results or from the nodes into a global result.
Many datacenter applications such as search and query processing, deep learning, graph and stream processing typically follow a partition-aggregation pattern. An example is the well-known MapReduce programming model for processing problems in parallel across huge datasets using a large number of computers arranged in a grid or cluster. In the partition phase, tasks and data sets are partitioned across compute nodes that process data locally (potentially taking advantage of locality of data to generate partial results. The partition phase is followed by the aggregation phase where the partial results are collected and aggregated to obtain a final result. The data aggregation phase in many cases creates a bottleneck on the network due to many-to-one or many-to-few types of traffic, i.e., many nodes communicating with one node or a few nodes or controllers.
For example, in large public datacenters analysis traces show that up to 46% of the datacenter traffic is generated during the aggregation phase, and network time can account for more than 30% of transaction execution time. In some cases network time accounts for more than 70% of the execution time.
Collective communication is a term used to describe communication patterns in which all members of a group of communication end-points participate. For example, in case of Message Passing interface (MPI) the communication end-points are MPI processes and the groups associated with the collective operation are described by the local and remote groups associated with the MPI communicator.
Many types of collective operations occur in HPC communication protocols, and more specifically in MPI and SHMEM (OpenSHMEM). The MPI standard defines blocking and non-blocking forms of barrier synchronization, broadcast, gather, scatter, gather-to-all, all-to-all gather/scatter, reduction, reduce-scatter, and scan. A single operation type, such as gather, may have several different variants, such as scatter and scatterv, which differ in such things as the relative amount of data each end-point receives or the MPI data-type associated with data of each MPI rank, i.e., the sequential number of the processes within a job or group.
The OpenSHMEM specification (available on the Internet from the OpenSHMEM website) contains a communications library that uses one-sided communication and utilizes a partitioned global address space. The library includes such operations as blocking barrier synchronization, broadcast, collect, and reduction forms of collective operations.
The performance of collective operations for applications that use such functions is often critical to the overall performance of these applications, as they limit performance and scalability. This comes about because all communication end-points implicitly interact with each other with serialized data exchange taking place between end-points. The specific communication and computation details of such operations depend on the type of collective operation, as does the scaling of these algorithms. Additionally, the explicit coupling between communication end-points tends to magnify the effects of system noise on the parallel applications using these, by delaying one or more data exchanges, resulting in further challenges to application scalability.
Previous attempts to mitigate the traffic bottleneck include installing faster networks and implementing congestion control mechanisms. Other optimizations have focused on changes at the nodes or endpoints, e.g., HCA enhancements and host-based software changes. While these schemes enable more efficient and faster execution, they do not reduce the amount of data transferred and thus are limited.