An artificial neural network, or simply “neural network,” is a computer model, resembling a biological network of neurons, which is trained by machine learning. A traditional neural network has an input layer, multiple middle or hidden layer(s), and an output layer. Each layer has a plurality (e.g., 100 s to 1000 s) of artificial “neurons.” Each neuron in a layer (N) may be connected by an artificial “synapse” to some or all neurons in a prior (N−1) layer and subsequent (N+1) layer to form a “partially-connected” or “fully-connected” neural network. The strength of each synapse connection is represented by a weight. Thus, a neural network may be represented by a set of all weights in the network.
A neural network (NN) is trained based on a learning dataset to solve or learn a weight of each synapse indicating the strength of that connection. The weights of the synapses are generally initialized, e.g., randomly. Training is performed by iteratively inputting a sample dataset into the neural network, outputting a result of the neural network applied to the dataset, calculating errors between the expected (e.g., target) and actual outputs, and adjusting neural network weights using an error correction algorithm (e.g., backpropagation) to minimize errors. Training may be repeated until the error is minimized or converges. Typically multiple passes (e.g., tens or hundreds) through the training set is performed (e.g., each sample is input into the neural network multiple times). Each complete pass over the entire training set is referred to as one “epoch”.
State-of-the-art neural networks typically have between millions and billions of weights, and as a result require specialized hardware (usually a GPU) for both training and runtime (prediction) phases. It is thereby impractical to run deep learning models, even in prediction mode, on most endpoint devices (e.g., IoT devices, mobile devices, or even laptops and desktops without dedicated accelerator hardware). Effectively running deep learning models on devices with limited processing speed and/or limited memory availability remains a critical challenge today.
To address the problem of limited hardware capacity, nowadays most deep learning prediction is conducted on a remote server or cloud. For example, a smart assistant (e.g., Alexa) sends information (e.g., voice signal) to the cloud, the deep learning prediction is performed remotely at the cloud on dedicated hardware, and a response is sent back to the local device. Hence, these endpoint devices cannot provide deep learning based results if they are disconnected from the cloud, if the input rate is so high that it is not feasible to continuously communicate with the cloud, or if very fast prediction is required where even the dedicated hardware is not fast enough today (e.g., deep learning for high frequency trading).
Accordingly, there is a need in the art to increase the efficiency and decrease the memory requirements of deep learning for neural network in training and/or prediction modes.