Artificial neural networks (ANNs) are computing systems inspired by the biological neural networks that constitute animal brains. Such systems learn, i.e. progressively improve performance, to do tasks by considering examples, generally without task-specific programming. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as “cat” or “no cat” and using the analytic results to identify cats in other images. They have found most use in applications difficult to express in a traditional computer algorithm using rule-based programming.
An ANN is based on a collection of connected units called artificial neurons, analogous to axons in a biological brain. Each connection or synapse between neurons can transmit a signal to another neuron. The receiving or postsynaptic neuron can process the signals and then signal downstream neurons connected to it. Neurons may have state, generally represented by real numbers, typically between 0 and 1. Neurons and synapses may also have a weight that varies as learning proceeds, which can increase or decrease the strength of the signal that it sends downstream. Further, they may have a threshold such that only if the aggregate signal is below or above that level is the downstream signal sent.
Typically, neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first, i.e. input, to the last, i.e. output, layer, possibly after traversing the layers multiple times.
The original goal of the neural network approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information.
The components of an artificial neural network include (1) neurons having an activation threshold; (2) connections and weights for transferring the output of a neuron; (3) a propagation function to compute the input to a neuron from the output of predecessor neurons; and (4) a learning rule which is an algorithm that modifies the parameters of the neural network in order for a given input to produce a desired outcome which typically amounts to modifying the weights and thresholds.
Given a specific task to solve, and a class of functions F, learning entails using a set of observations to find the function that which solves the task in some optimal sense. A cost function C is defined such that, for the optimal solution no other solution has a cost less than the cost of the optimal solution).
The cost function C is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.
A neural network can be trained using backpropagation which is a method to calculate the gradient of the loss function with respect to the weights in an ANN.
The weight updates of backpropagation can be done via well-known stochastic gradient descent techniques. Note that the choice of the cost function depends on factors such as the learning type (e.g., supervised, unsupervised, reinforcement) and the activation function.
There are three major learning paradigms and each corresponds to a particular learning task: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning uses a set of example pairs and the goal is to find a function in the allowed class of functions that matches the examples. A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output and the target value over all example pairs. Minimizing this cost using gradient descent for the class of neural networks called multilayer perceptrons (MLP), produces the backpropagation algorithm for training neural networks. Examples of supervised learning include pattern recognition, i.e. classification, and regression, i.e. function approximation.
In unsupervised learning, some data is given and the cost function to be minimized, that can be any function of the data and the network's output. The cost function is dependent on the task (i.e. the model domain) and any a priori assumptions (i.e. the implicit properties of the model, its parameters, and the observed variables). Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression, and filtering.
In reinforcement learning, data is usually not provided, but generated by an agent's interactions with the environment. At each point in time, the agent performs an action and the environment generates an observation and an instantaneous cost according to some typically unknown dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, e.g., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.
In machine learning, a convolutional neural network (CNN) is a class of deep, feedforward artificial neural network that has successfully been applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. Convolutional networks were inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.
CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage. CNNs have applications in image and video recognition, recommender systems and natural language processing. A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers are either convolutional, pooling or fully connected.
Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution approximates the response of an individual neuron to visual stimuli. Each convolutional neuron processes data only for its receptive field. Tiling allows CNNs to tolerate translation of the input image. The convolution operation reduces the number of free parameters and improves generalization.
Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters at one layer into a single neuron in the next layer. For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer. This is in contrast to fully connected layers which connect every neuron in one layer to every neuron in another layer.
A feature of CNNs is that they share weights in convolutional layers, which means that the same filter (i.e. weight bank) is used for each receptive field in the layer; this reduces memory requirements and improves performance.
While traditional multilayer perceptron (MLP) models are successfully used for image recognition, due to the full connectivity between nodes, they suffer from the curse of dimensionality and thus do not scale well to higher resolution images. In addition, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart the same as pixels that are close together. Thus, full connectivity of neurons is wasteful for the purpose of image recognition.
Convolutional neural networks are biologically inspired variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. As opposed to MLPs, CNNs have the following distinguishing features:                1. 3D volumes of neurons: The layers of a CNN have neurons arranged in three dimensions: width, height and depth. The neurons inside a layer are connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.        2. Local connectivity: Following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learnt “filters” produce the strongest response to a spatially local input pattern. Stacking many such layers leads to non-linear “filters” that become increasingly “global” (i.e. responsive to a larger region of pixel space). This allows the network to first create representations of small parts of the input, then from them assemble representations of larger areas.        3. Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature (within their specific response field). Replicating units in this way allows for features to be detected regardless of their position in the visual field, thus constituting the property of translation invariance.        
Together, these properties allow CNNs to achieve better generalization on vision problems. Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network. Decreasing the memory footprint allows the training of larger, more powerful networks.
A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume through a differentiable function.
The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a two-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.
Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.
When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume. The extent of this connectivity is a hyperparameter called the receptive field of the neuron. The connections are local in space (along width and height), but always extend along the entire depth of the input volume. Such an architecture ensures that the learnt filters produce the strongest response to a spatially local input pattern.
Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride and zero-padding.                1. The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color.        2. Stride controls how depth columns around the spatial dimensions (width and height) are allocated. When the stride is one the filters are moved one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and also to large output volumes. When the stride is two (or rarely three or more) then the filters jump two pixels at a time as they slide around. The receptive fields overlap less and the resulting output volume has smaller spatial dimensions.        3. Sometimes it is convenient to pad the input with zeros on the border of the input volume. Padding provides control of the output volume spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume.        
A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on a reasonable assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. In other words, denoting a single two-dimensional slice of depth as a depth slice, the neurons in each depth slice are constrained to use the same weights and bias.
Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the CONV layer can be computed as a convolution of the neuron's weights with the input volume (hence the name: convolutional layer). Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume.
Another feature of CNNs is pooling which is a form of nonlinear down-sampling. There are several nonlinear functions to implement pooling among which max pooling is the most common. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum. The intuition is that the exact location of a feature is less important than its rough location relative to other features. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network, and hence to also control overfitting. In addition to max pooling, the pooling units can use other functions, such as average pooling or L2-norm pooling.
Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular neural networks.
In essence CNNs work under the assumption of spatially correlated input domain, meaning nearby pixels have a special correlation to one another. This gives rise to the ability of weight compaction (i.e. utilizing a relatively small kernel instead of full mesh) under the assumption that locality plays a key role in the input-output correlation that is learned by the CNN.
Typically, CNN layers operate sequentially (i.e. row by row) on the input domain of each layer. This is not, however, necessarily the ideal order of implementation. Many times, valuable information in an image is focus-centered, i.e. there are one or more regions where it is desirable to place more attention and for which more elaborate evaluation is required.
There is thus a need for an ANN and particularly a CNN that recognizes and takes advantage of the fact that valuable information in an image is typically not distributed throughout the image but rather is concentrated in one or more regions.