Neural networks, especially deep neural networks have been very successful in modeling high-level abstractions in data. Neural networks are computational models used in machine learning made up of nodes organized in layers. The nodes are also referred to as artificial neurons, or just neurons, and perform a function on provided input to produce some output value. A neural network requires a training period to learn the parameters, i.e., weights, used to map the input to a desired output. The mapping occurs via the function. Thus the weights are weights for the mapping function of the neural network. Each neural network is trained for a specific task, e.g., prediction, classification, encoding, etc. The task performed by the neural network is determined by the inputs provided, the mapping function, and the desired output. Training can either be supervised or unsupervised. In supervised training, training examples are provided to the neural network. A training example includes the inputs and a desired output. Training examples are also referred to as labeled data because the input is labeled with the desired output. The network learns the values for the weights used in the mapping function that most often result in the desired output when given the inputs. In unsupervised training, the network learns to identify a structure or pattern in the provided input. In other words, the network identifies implicit relationships in the data. Unsupervised training is used in deep neural networks as well as other neural networks and typically requires a large set of unlabeled data and a longer training period. Once the training period completes, the neural network can be used to perform the task it was trained for.
In a neural network, the neurons are organized into layers. A neuron in an input layer receives the input from an external source. A neuron in a hidden layer receives input from one or more neurons in a previous layer and provides output to one or more neurons in a subsequent layer. A neuron in an output layer provides the output value. What the output value represents depends on what task the network is trained to perform. Some neural networks predict a value given in the input. Some neural networks provide a classification given the input. When the nodes of a neural network provide their output to every node in the next layer, the neural network is said to be fully connected. When the neurons of a neural network provide their output to only some of the neurons in the next layer, the network is said to be convolutional. In general, the number of hidden layers in a neural network varies between one and the number of inputs.
To provide the output given the input, the neural network must be trained, which involves learning the proper value for a large number (e.g., millions) of parameters for the mapping function. The parameters are also commonly referred to as weights as they are used to weight terms in the mapping function. This training is an iterative process, with the values of the weights being tweaked over thousands of rounds of training until arriving at the optimal, or most accurate, values. In the context of neural networks, the parameters are initialized, often with random values, and a training optimizer iteratively updates the parameters, also referred to as weights, of the network to minimize error in the mapping function. In other words, during each round, or step, of iterative training the network updates the values of the parameters so that the values of the parameters eventually converge on the optimal values. Training is an iterative process that involves thousands of rounds, and sometimes hundreds of thousands of rounds, of updating the millions of parameters until the optimal parameter values are achieved. Training periods for neural networks can be long due to the number of weights to be learned and the size of the neural network. Training can take days and training of deep networks can even take weeks due to the size of the deep networks, the large number of parameters, and the size of the input datasets. To help speed training time, some neural networks use a training optimizer. The most widely used training optimizer is Stochastic Gradient Descent (SGD), although other optimizers, like Adagrad, Adadelta, RMSProp, and Adam, may also be used. Even with training optimizers, it can still take days to reach convergence, i.e., to train a neural network.