Deep Convolution Neural Networks, or Deep CNN is the most core of the remarkable development in the field of Deep Learning. Though the CNN has been employed to solve character recognition problems in 1990s, it is not until recently that the CNN has become widespread in Machine Learning. For example, in 2012, the CNN significantly outperformed its competitors in an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, and won the contest. After that, the CNN has become a very useful tool in the field of machine learning.
Meanwhile, a batch normalization is a method of normalizing feature maps using each of averages and each of variances per each channel of each mini batch including training images. The batch normalization is used for preventing internal covariate shift, which is an undesirable phenomenon that distributions of inputs change layer by layer in the Neural Network. Advantages of the batch normalization include (i) accelerating the speed of training, (ii) evenly training entire weights in the neural network, and (iii) augmenting features.
A conventional batch normalization will be explained by referring to FIG. 4.
By referring to FIG. 4, it may be seen that, if there are m batches including input images, the conventional batch normalization is performed by (i) instructing a convolutional layer of each convolutional block to generate feature maps for a k-th batch by applying convolution operations to the input images included in the k-th batch, (ii) instructing a batch normalization layer of each convolutional block to generate one or more averages and one or more variations of the feature maps for the k-th batch, and (iii) instructing a batch normalization layer of each convolutional block to normalize the feature maps for the k-th batch so that the averages and the variations would become a specific first value and a specific second value respectively. That is, the batch normalization for a specific batch is performed by using only averages and variations of feature maps for the specific batch. Herein, k is an integer from 1 to m.
If a size of each batch is set to be too small, it cannot be assumed that values of the feature maps for each of the batches follow a normal distribution. Because of this, a performance of the conventional batch normalization falls. This is a critical shortcoming of the conventional batch normalization.