Convolutional Neural Networks (CNNs) are known in the art. Such networks are typically employed for object detection and classification in images. A Convolutional Neural Network (CNN) is typically constructed of one of more layers. At each layer, an operation is performed. Typically, this operation is one of a convolution operations and multiplication by an activation function. This operation may further include pooling also referred to as down-sampling.
For each layer a respective set meta-parameters are defined. These meta-parameters include the number of filters employed, the size of the filters, the stride of the convolution the down-sampling ratio, the size of the down-sampling size, the stride thereof, the activation function employed and the like. Reference is now made to FIG. 1, which is a schematic illustration of a CNN, generally referenced 10, which is known in the art. CNN 10 is employed for detecting features in an image such as image 16. Neural network 10 includes a plurality of layers, such as layer 121 (FIG. 1). CNN 10 includes a plurality of layers 121, 122, . . . , 12N and a classifier 14. An input image 16 is supplied to layer 121. Layer 121 at least convolves image 16 with the respective filters thereof and multiplies each of the outputs of the filters by an activation function. Layer 121 provides the output thereof to layer 122 which performs the respective operations thereof with the respective filters. This process repeats until the output of layer 12N is provided to classifier 14. The output of Layer 12N is a map of features corresponding to the filters employed in CNN 10. This feature map relates to the probability that a feature is present in input image 16 within respective image windows associated with the feature map. The features map at the output of layer 12N can be embodied as a plurality of matrices, each corresponding to a feature, where the value of entry in each matrix represents the probability that input image 16 includes the feature associated with that matrix, in a specific image window (i.e., a bounding box) associated with the entry location in the matrix (i.e., the indices of the entry). The size of the image window is determined according the number layers in CNN 10, the size of the kernels and the stride of the kernels during the convolution operation.
Classifier 14 may be any type of classifier known in the art (e.g., Random Forest Classifier, Support Vector Machine—SVM classifier, a convolutional classifier and the like). Classifier 14 classifies the objects which the CNN 10 was trained to detect. Classifier 14 may provide for each image window, a respective detection confidence level that an object is located in that image window as well as classification information. In general, the output of classifier 14 is a vector or vectors of values relating to the detection and classification of the object in a corresponding image window. This vector or vectors of values are referred to herein as a ‘classification vector’.
Reference is now made to FIG. 2, which is a schematic illustration of an exemplary CNN, generally referenced 50, which is known in the art. CNN includes two layers, a first layer 511 and a second layer 512. First layer 511 receives image 52 as input thereto. In first layer 511, a convolution operation in performed while in second layer 512 an activation function is applied on the results of the convolution. Image 52 includes a matrix of pixels where each pixel is associated with a respective value (e.g., grey level value) or values (e.g., color values). Image 52 may represent a scene which includes objects (e.g. a person walking in the street, a dog playing in a park, a vehicle in a street and the like).
In first layer 511, image 52 is convolved with each one of filters 541 and 542. Filters 541 and 542 are also referred to as convolution kernels or just kernels. Accordingly, each one of filters 541 and 542 is shifted over selected positions in the image. At each selected position, the pixel values overlapping with filter are multiplied by the respective weights of the filter and the result of this multiplication is summed (i.e., a multiply and sum operation). Generally, the selected positions are defined by shifting the filter over the image by a predetermined step size referred to as ‘stride’. Each one of filters 541 and 542 corresponds to a feature to be identified in the image. The sizes of the filters as well as the stride are design parameters selected by the CNN designer. Convolving image 52 with each one of filters 541 and 542 produces a feature map which includes two feature images or matrices, feature image 561 and feature image 562 respective of filters 541 and 542 (i.e., a respective image is produced for each filter). Each pixel or entry in the feature image corresponds to the result of one multiplication and sum operation. Thus, each one of matrices 561 and 562 is associated with a respective image feature corresponding to the respective one of filters 541 and 542. Also, each entry is associated with a respective image window with respect to input image 52. Accordingly, the value of each entry in each one of matrices 561 and 562 represents the feature intensity of the feature associated therewith, within the image window associated with the entry. It is noted that the size (i.e., the number of pixels) of the feature images 561 and 562 may be smaller than the size of image 52. The output of first layer 511 is provided to second layer 512. In second layer 512, each value in each of the feature images 561 and 562 is then applied as an input to an activation function 58 (e.g., sigmoid, Gaussian, hyperbolic tan h and the like). The output of layer 512 is then provided to classifier 60 which detects and classifies objects in image 52 and produces a classification vector for each entry in the feature map.
Prior to detecting and classifying objects in an image, the weights of the various filters and parameters of the functions employed by a CNN such as CNN 10 (FIG. 1) or CNN 50 (FIG. 2) need to be determined. These weights parameters are determined in a training process. The initial weights and parameters of the CNN (i.e., before training is commenced) are determined arbitrarily (e.g., randomly). During training, a training image or images, in which the objects have been detected and classified, are provided as the input to the CNN. In other words, images with pre-determined respective classification vector for each image window are provided as an input to the CNN. The layers of the CNN network are applied to each training image and the classification vectors, respective of each training image, are determined (i.e., the objects therein are detected and classified). These classification vectors are compared with the pre-determined classification vectors. The error (e.g., the squared sum of differences, log loss, softmaxlog loss) between the classification vectors of the CNN and the pre-determined classification vectors is determined. This error is than employed to update the weights and parameters of the CNN in a backpropagation process which may include one or more iterations.
The publication “A convolutional Neural Network Cascade for Face Detection” to Li et al, directs to a CNN which includes three pairs of networks. Each pair contains classification (detection) network and bounding box regression network. During detection, an image pyramid is generated to allow multi-scale scanning of the image. Then, first classification network (DET12) is employed to scan all the windows in the image and filter those exhibiting low confidence. The first bounding box regression network (CLB12) is employed to correct the location of all remaining windows. Non-maximal suppression is then applied to remove windows with high overlap. In the next stage, a second classification network (DET24) is employed to filter the remaining windows, followed by a second bounding box regression network (CLB24) that performs bounding box regression. Finally, the third classification network (DET48) is employed followed by a third bounding box regression network (CLB48).