An Artificial Neural Network (ANN)—also referred to simply as a neural network—is a computing system made up of a number of simple, highly interconnected processing elements (nodes), which process information by their dynamic state response to external inputs. ANNs are processing devices (algorithms and/or hardware) that are loosely modeled after the neuronal structure of the mammalian cerebral cortex but on much smaller scales. A large ANN might have hundreds or thousands of processor units, whereas a mammalian brain has billions of neurons with a corresponding increase in magnitude of their overall interaction and emergent behavior. A feedforward neural network is an artificial neural network where connections between the units do not form a cycle.
In machine learning, a convolutional neural network (CNN) is a type of feed-forward artificial neural network in which the connectivity pattern between its nodes (neurons) is inspired by the organization of the animal visual cortex, whose individual neurons are arranged to respond to overlapping regions tiling a visual field. Convolutional networks mimic biological processes and are configured as variations of multilayer perceptrons designed to use minimal amounts of preprocessing while processing data, such as digital images.
Convolutional neural networks (CNN) are networks with overlapping “reception fields” performing convolution tasks. A CNN is particularly efficient in recognizing image features, such as by differentiating pixels or pixel regions in a digital image from other pixels or pixel regions in the digital image. Generally, a CNN is designed to recognize images or parts of an image, such as detecting the edges of an object recognized on the image. Computer vision is a field of endeavor where CNNs are commonly used.
A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers of units between the input and output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships. DNN architectures, e.g., for object detection and parsing, generate compositional models where the object is expressed as a layered composition of image primitives. The extra layers enable composition of features from lower layers, giving the potential of modeling complex data with fewer units than a similarly performing shallow network. DNNs are typically designed as feedforward networks.
Many large scale data-intensive applications rely on both input data and a large number of model parameters to conduct computations. Deep learning algorithms are typical examples of this category. Machine learning algorithms generate models to fit training data and then use the generated models to generate predictions for input data. Models are generally mathematical equations and/or logic having model parameters. Model training is used to find appropriate values of the model parameters, e.g., weights of neural nodes in a neural network, so that the models can provide accurate predictions. In a typical example of training of a model, a batch of image data is input to a model and computations are performed on the image data using the model to provide an output used to train the model.
As the network is trained, the neurons in the intermediate layers organize themselves in such a way that the different neurons learn to recognize different characteristics of a total input space. After training, when an arbitrary input is input to the neural network, neurons in the hidden layer of the network respond with an active output if the new input contains a pattern that resembles a feature that the individual neurons have learned to recognize during their training.
Gradients generated for different items within the same batch are accumulated during batch processing, and normalized at the end of the batch resulting in an iteration for each batch processing. Current deep learning frameworks utilize multiple local graphics processing units (GPUs) to accelerate training. Local GPUs are GPUs that are located within a single node of a machine. Distributed GPUs are GPUs that are located in different machines in communication with one another over a network.
A typical machine may include multiple GPUs located within a node of the machine (which is distinct from a neural node of a neural network), such as a non-uniform memory access (NUMA) node. A NUMA node often includes a physical CPU, memory banks, a network interface controller (NIC), and multiple GPU devices. The network devices and GPUs are typically attached to the CPU through a Peripheral Component Interconnect (PCI) root complex device. A root complex device connects the CPU and memory subsystem to each of the GPUs and the NIC. In addition, multiple machines, each having multiple GPUs, are often networked together to implement a deep learning neural network. During training of the neural network, input data and workloads are distributed over GPUs on a cluster of machines such that each GPU computes parameters for the neural network that must be aggregated and synchronized between the GPUs. Often a parameter server is used to receive parameters from each GPU, aggregate the parameters, and provide updated parameters to each of the GPUs. In other implementations, the GPUs may use peer-to-peer communication to aggregate parameters. Iterative training algorithms such as a stochastic gradient descent algorithm often require the training status or parameters (e.g., a gradient) received from different GPUs to be aggregated and synchronized every few iterations.
Conventionally, each GPU receives the parameter updates from the network independently from the other GPUs such that a network communication must be performed for every GPU implementing the neural network regardless if it is located in a machine having multiple GPUs. As a result, network traffic is greatly amplified as the number of GPUs within a single machine increases which may cause network congestion.
Accordingly, a more efficient method of providing parameter updates within a host machine having multiple GPUs is needed. Various embodiments described herein provide for the use of local multicast to distribute parameters between GPUs in a single host machine to improve network efficiency of multi-GPU based deep learning networks.