A processing flow for typical Convolutional Neural Network (CNN) is presented in
FIG. 1. Typically, the input to the CNN is at least one 2D image/map 10 corresponding to a region of interest (ROI) from an image. The image/map(s) can comprise image intensity values only, for example, the Y plane from a YCC image; or the image/map(s) can comprise any combination of colour planes from an image; or alternatively or in addition, the image/map(s) can contain values derived from the image such as a Histogram of Gradients (HOG) map as described in PCT Application No. PCT/EP2015/073058, the disclosure of which is incorporated by reference, or an Integral Image map.
CNN processing comprises two stages:                Feature Extraction (12)—the convolutional part; and        Feature classification (14).        
CNN feature extraction 12 typically comprises a number of processing layers 1 . . . N, where:                Each layer comprises a convolution followed by optional subsampling;        Each layer produces one or (typically) more maps;        The size of the maps after each convolution layer is typically reduced by subsampling;        A first convolution layer typically performs 2D convolution of an original 2D image/map to produce its output maps, while subsequent convolution layers perform 3D convolution using the output maps produced by the previous layer as inputs. Nonetheless, if the input comprises say a number of maps previously derived from an image; or multiple color planes of an image; or multiple versions of an image, then the first convolution layer can operate in exactly the same way as successive layers, performing a 3D convolution on the input images/maps.        
FIG. 2 shows an example 3D convolution with a 3×3×3 kernel performed by a subsequent feature extraction convolution layer of FIG. 1. The 3×3×3 means that three input maps A, B, C are used and so, a 3×3 block of pixels from each input map is needed in order to calculate one pixel within an output map.
A convolution kernel also has 3×3×3=27 values or weights pre-calculated during a training phase of the CNN. The cube 16 of input map pixel values is combined with the convolution kernel values 18 using a dot product function 20. After the dot product is calculated, an activation function 22 is applied to provide the output pixel value. The activation function 22 can comprise a simple division, as normally done for convolution, or a more complex function such as sigmoid function or a rectified linear unit (ReLU) activation function of the form: zj=h(aj)=max(0,aj) as typically used in neural networks.
In this case, for 2D convolution, where a single input image/map is being used, the input image/map would be scanned with a 3×3 kernel to produce the pixels of a corresponding output map.
Referring back to FIG. 1, the layers involved in CNN feature classification 14 are typically as follows:                The maps produced by the last convolutional layer are concatenated in a single vector (Vinput);        Vinput is the input to a multi-layer fully connected network comprising a sequence of fully connected network layers, each processing a vector input and providing a vector output;        The output of the fully connected network comprises a vector of classification scores or a feature vector representative of the input image/map(s) in accordance with the CNN training.        
The CNN is trained to classify the input ROI into one or more classes. For example, for a ROI potentially containing a face, a CNN might be used to determine if the face belongs to an adult or a child; if the face is smiling, blinking or frowning. For ROI potentially containing a body, the CNN might be used to determine a pose for the body.
Once the structure of the CNN is determined, i.e. the input maps, the number of convolution layers; the number of output maps; the size of the convolution kernels; the degree of sub-sampling; the number of fully connected layers; and the extent of their vectors—the weights to be used within the convolution layer kernels and the fully connected layers used for feature classification are determined by training against a sample data set containing positive and negative labelled instances of a given class, for example, faces labelled as smiling and regions of interest containing non-smiling faces. Suitable platforms for facilitating the training of a CNN are available from: PyLearn which is based on Theanoand MatConvNet which is in turn based on Caffe; Thorch; or TensorFlow. It will nonetheless be appreciated that the structure chosen for training may need to be iteratively adjusted to optimize the classification provided by the CNN.
In any case, it would be useful to incorporate a CNN engine within an image processing system so that feature classification might be performed on the fly as images are acquired or at least soon afterwards. For example, a CNN might be incorporated within an image acquisition system such as described in U.S. Provisional Application No. 62/210,243 filed 26 Aug. 2015, PCT Application WO2014/005783 and US2015/262344, the disclosures of which are incorporated by reference.
However, in order to do so, the responsiveness and memory requirements for the CNN need to be rationalized.