Deep Convolution Neural Networks, or Deep CNN, is the core of the remarkable development in the field of Deep Learning. Though CNN was already employed to solve character recognition problems in 1990s, it is not until recently that CNN has become widespread in Machine Learning. Due to the recent researches, Convolution Neural Networks (CNN) have been a very useful and powerful tool in the field of Machine Learning. For example, in 2012, Deep CNN significantly outperformed its competitors in an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, and won the contest.
As a result, a new trend to adapt Deep Learning technologies for image segmentation has been emerged. For a reference, image segmentation may include processes of partitioning an input image, e.g. a training image or a test image, into multiple semantic segments and producing a set of the semantic segments with clear boundaries such that the semantic segments collectively cover the entire input image. A result of the image segmentation is so-called a label image.
FIG. 1 is a drawing illustrating a learning process of CNN capable of performing image segmentation according to prior art.
Referring to FIG. 1, feature maps corresponding to an input image, i.e. a training image, are acquired by applying convolution operations multiple times to the input image through a plurality of filters, i.e. convolutional filters, in an encoding layer. Then, a label image corresponding to the input image is obtained by applying deconvolution operations multiple times to a specific feature map, i.e., an ultimate output from the encoding layer.
In detail, a configuration of CNN that encodes the input image by the convolution operations to obtain its corresponding feature maps and decodes the ultimate output from the encoding layer to obtain the label image is named as an encoding-decoding network, i.e. U-Net. During the encoding process, a size of the input image or sizes of its corresponding feature maps may be reduced to a half whereas number of channels of the input image or that of its corresponding feature maps may be increased whenever a convolution operation is performed. This is to reduce an amount of computations by scaling down the size of the input image or its corresponding feature maps and to extract complex patterns through the increased number of channels.
The downsized feature maps do not have much of its high-frequency regions but retain information on its low-frequency regions which represent semantic and detailed parts of the input image, e.g. sky, roads, architectures, and cars etc. Such meaningful parts of the input image are used to infer the label image by performing the deconvolution operations during a decoding process.
Recently, efforts have been made to improve the performance of the image segmentation processes using the U-Net.
Accordingly, the applicant of the present invention intends to disclose a new method for allowing the information on the feature maps obtained from the encoding layer to be used in the decoding process, so as to increase the performance of the image segmentation.