Deep Convolution Neural Networks, or Deep CNN, is the core of the remarkable development in the field of Deep Learning. Though CNN was already employed to solve character recognition problems in 1990s, it is not until recently that CNN has become widespread in Machine Learning. Due to the recent researches, Convolution Neural Networks (CNN) have been a very useful and powerful tool in the field of Machine Learning. For example, in 2012, Deep CNN significantly outperformed its competitors in an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, and won the contest.
As a result, a new trend to adapt Deep Learning technologies for image segmentation has been emerged. For a reference, image segmentation may include processes of partitioning an input image, e.g., a training image or a test image, into multiple semantic segments and determining a set of the semantic segments with clear boundaries such that the semantic segments collectively covering the entire input image. A result of the image segmentation is so-called a label image.
FIG. 1 is a drawing schematically illustrating a process of learning for image segmentation using CNN according to a prior art. Referring to FIG. 1, feature maps corresponding to an input image, i.e. a training image, are acquired by applying convolution operations multiple times to the input image through a plurality of convolutional filters. Then, a label image corresponding to the input image is obtained by applying deconvolution operations multiple times to an ultimate output from the convolutional layers, through a plurality of deconvolutional filters.
In detail, a configuration of CNN that encodes the input image by the convolution operations and decodes the feature map by the deconvolution operations to obtain the label image is named as an encoding-decoding network, i.e. U-Net. During an encoding process, a size of an output of each convolutional filter is reduced to a half of the size of an input thereof whereas number of channels of the output is increased as twice as that of the input thereof whenever a convolution operation is applied. This is to reduce an amount of computations by scaling down the size of the input image or that of its corresponding feature maps, and thus extracting complex patterns through the increased number of channels while taking an advantage of the reduced amount of computations. In general, passing through respective convolution filters causes the size of the input image or that of its corresponding feature maps to be scaled down by a ratio of 1/2 and the number of channels thereof to be doubled.
Moreover, the downsized feature maps remove much of its high-frequency regions and retain information with respect to its low-frequency regions which represent semantic and detailed parts of the input image, e.g. sky, roads, architectures, and cars etc. Such meaningful parts of the input image are used to infer the label image by performing the deconvolution operations during the decoding process.
Further, for a learning process of CNN adopting Deep Learning, a loss that is a difference between Ground Truth (GT) label image and the label image predicted from the training image is computed. And during a backpropagation process, the computed loss is relayed in a reverse direction, which is a direction opposite to that of generating the label image. However, there is a problem in that values of the loss becomes smaller and smaller as it is propagated back in the reverse direction and becomes too small to adjust parameters of each filter on the U-Net.
Thus, the inventor of the present invention proposes a novel approach, which can solve the above-mentioned problem.