An artificial neural network, or simply “neural network” (NN), is a computer model, resembling a biological network of neurons, which is trained by machine learning. A traditional neural network has an input layer, multiple middle or hidden layer(s), and an output layer. Each layer has a plurality (e.g., 100 s to 1000 s) of artificial “neurons.” Each neuron in a layer (N) may be connected by an artificial “synapse” to some or all neurons in a prior (N−1) layer and subsequent (N+1) layer to form a “partially-connected” or “fully-connected” neural network. The strength of each synapse connection is represented by a weight. Thus, a neural network may be represented by a set of all weights in the network.
A neural network is trained based on a learning dataset to solve or learn a weight of each synapse indicating the strength of that connection. The weights of the synapses are generally initialized, e.g., randomly. Training is performed by iteratively inputting a sample dataset into the neural network, propagating forward through the neural network to output a result of the neural network applied to the dataset, calculating errors between the expected (e.g., target) output and actual output, and propagating backwards through the neural network to adjust neural network weights using an error correction algorithm (e.g., backpropagation) to minimize errors. Training may be repeated until the error is minimized or converges. Typically, multiple passes (e.g., tens or hundreds) through the training set is performed (e.g., each sample is input into the neural network multiple times). Each complete pass over the entire training dataset is referred to as one “epoch.”
State-of-the-art neural networks typically have between millions and billions of weights, and as a result require specialized hardware (e.g., a GPU) for both training and runtime (a.k.a. prediction or inference) phases. It is thereby impractical to run deep learning models, even in prediction mode, on most endpoint devices (e.g., IoT devices, mobile devices, or even laptops and desktops without dedicated accelerator hardware). Effectively running deep learning models on devices with limited processing speed and/or limited memory availability remains a critical challenge today.
To address the problem of limited hardware capacity, nowadays most deep learning prediction is conducted on a remote server or cloud. For example, a smart assistant (e.g., Alexa™) sends information (e.g., a voice signal) to the cloud, the deep learning prediction is performed remotely at the cloud on dedicated hardware, and a response is sent back to the local device. Hence, these endpoint devices cannot provide deep learning based results if they are disconnected from the cloud, if the input rate is so high that it is not feasible to continuously communicate with the cloud, or if very fast prediction is required where even the dedicated hardware is not fast enough today (e.g., deep learning for high frequency trading).
Accordingly, there is a need in the art to increase the efficiency and processing speed of the computer performing deep learning using a neural network in training and/or prediction modes.