A processing flow for typical Convolutional Neural Network (CNN) is presented in FIG. 1. Typically, the input to the CNN is at least one 2D image/map 10 corresponding to a region of interest (ROI) from an image. The image/map(s) can comprise image intensity values only, for example, the Y plane from a YCC image; or the image/map(s) can comprise any combination of colour planes from an image; or alternatively or in addition, the image/map(s) can contain values derived from the image such as a Histogram of Gradients (HOG) map as described in PCT Application No. PCT/EP2015/073058 (Ref: FN-398), the disclosure of which is incorporated by reference, or an Integral Image map.
CNN processing comprises two stages:                Feature Extraction (12)—the convolutional part; and        Feature classification (14).        
CNN feature extraction 12 typically comprises a number of processing layers 1 . . . N, where:                Each layer comprises a convolution followed by optional subsampling (pooling);        Each layer produces one or (typically) more maps (sometimes referred to as channels);        The size of the maps after each convolution layer is typically reduced by subsampling (examples of which are average pooling or max-pooling);        A first convolution layer typically performs 2D convolution of an original 2D image/map to produce its output maps, while subsequent convolution layers perform 3D convolution using the output maps produced by the previous layer as inputs. Nonetheless, if the input comprises say a number of maps previously derived from an image; or multiple color planes, for example, RGB or YCC for an image; or multiple versions of an image, then the first convolution layer can operate in exactly the same way as successive layers, performing a 3D convolution on the input images/maps.        
FIG. 2 shows an example 3D convolution with a 3×3×3 kernel performed by a subsequent feature extraction convolution layer of FIG. 1. The 3×3×3 means that three input maps A, B, C are used and so, a 3×3 block of pixels from each input map is needed in order to calculate one pixel within an output map.
A convolution kernel also has 3×3×3=27 values or weights pre-calculated during a training phase of the CNN. The cube 16 of input map pixel values is combined with the convolution kernel values 18 using a dot product function 20. After the dot product is calculated, an activation function 22 is applied to provide the output pixel value. The activation function 22 can comprise a simple division, as normally done for convolution, or a more complex function such as sigmoid function or a rectified linear unit (ReLU) activation function of the form: yj=h(xj)=max (0,xj) as typically used in neural networks.
In this case, for 2D convolution, where a single input image/map is being used, the input image/map would be scanned with a 3×3 kernel to produce the pixels of a corresponding output map.
Within a CNN Engine such as disclosed in PCT Application No. PCT/EP2016/081776 (Ref: FN-481-PCT) a processor needs to efficiently implement the logic required to perform the processing of different layers such as convolution layers and pooling layers.