Deep Convolution Neural Networks, or Deep CNN, is the core of the remarkable development in the field of Deep Learning. Though the CNN has been employed to solve character recognition problems in 1990s, it is not until recently that the CNN has become widespread in Machine Learning. Due to the recent researches, the CNN has been a very useful and powerful tool in the field of Machine Learning. For example, in 2012, the CNN significantly outperformed its competitors in an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, and won the contest.
FIG. 1 is a block diagram of a device adopting the CNN according to prior art.
Referring to FIG. 1, the device 100 includes a feature computation block 101, an application block 102, and an application-specific loss block 103.
Upon receiving an input image, the feature computation block 101 including one or more convolution blocks and Feature Pyramid Networks (FPN) blocks may generate feature maps from the input image. For a reference, each of the convolution blocks may be comprised of various layers such as convolutional layer, pooling layer, fully-connected layer, and activation layer, e.g., ReLU layer.
The application block 102 may utilize at least part of the generated feature maps to acquire an application-specific output. For example, if the application block 102 performs a function of image segmentation, the application block 102 determines a type, e.g., person, car, foreground or background, for each pixel in the input image and cluster pixels with the same type to generate a label image. Or, if the application block 102 performs a function of object detection, information on the type, location, and size of object(s) in the input image may be outputted.
Moreover, the application loss block 103 may compare between the application-specific output obtained from the application block 102 and its corresponding Ground Truth (GT) to compute a loss. Then, the device 100 may obtain optimal parameters by using the computed loss during a first backpropagation process. Thereafter, the device 100 may remove the application loss block 103 for a real test.
FIG. 2A is a diagram illustrating an amount of computations varied according to a size of an input image whereas FIG. 2B is a diagram showing an accuracy of a result of an application, e.g., object detection, varied according to the size of the input image.
As shown in FIGS. 2A and 2B, the amount of computations of the CNN adapted for the device is proportional to the size of the input image. This is also the case for the object detection accuracy.
If the number of pixels in the input image is reduced, the amount of computations is reduced as well. However, as shown in FIG. 2B, the detection accuracy may be sacrificed due to the reduced size of the input image.
FIG. 3 is a block diagram schematically illustrating a process of generating feature maps by using a conventional CNN with a configuration including the FPN blocks according to prior art.
Referring to FIG. 3, the feature computation block 101 may include a plurality of convolution blocks, i.e., a (1-1)-th to a (1-k)-th filter blocks, for performing convolution operations. As shown in FIG. 3, each of the convolution blocks is comprised of multiple layers. In detail, each of the (1-1)-th to the (1-k)-th filter blocks includes alternately an arbitrary number of a convolutional layer and an activation layer, e.g., Rectified Linear Unit (ReLU). Such an iterative configuration repeatedly performs the convolution operations along with non-linear operations.
The (1-1)-th filter block generates a (1-1)-th feature map from the input image, and the (1-2)-th filter block generates a (1-2)-th feature map from the (1-1)-th feature map, and so on. Each of the filter blocks sequentially generates each of corresponding feature maps.
Due to the fact that number of channels of the input image is increased while the size thereof is decreased by each of the (1-1)-th to the (1-k)-th filter blocks, if the input image with a size and a channel of W×H×3 is carried to the (1-1)-th filter block, the (1-1)-th feature map with a size and a channel of W/2×H/2×C and the (1-2)-th feature map with a size and a channel of W/4×H/4×2C may be generated, and so on. Herein, each first factor, e.g., W, W/2, W/4, stands for a width of the input image, each second factor, e.g., H, H/2, H/4, represents a height thereof and each third factor, e.g., 3, C, 2C, stands for the number of channels thereof. Hence, the convolution blocks, i.e., the (1-1)-th to the (1-k)-th filter blocks, may generate the feature maps with various sizes and number of channels, respectively.
Referring to FIG. 3 again, a plurality of FPN blocks, i.e., a 1-st to a (k−1)-th FPN blocks, are respectively connected to each of the corresponding (1-1)-th to (1-k)-th filter blocks. Each of the FPN blocks includes a 1×1 convolution filter used for adjusting the number of channels of the feature map received from its corresponding filter block, an up-sampling block used for increasing the size of the feature map received from a previous FPN block, and a computation unit used for summing up an output of the 1×1 convolution filter and an output of the up-sampling block and then allowing the summed output to be provided to a next FPN block. Herein, the up-sampling block may double the size of the feature map received from the previous FPN block so that the size of the feature map received from the previous FPN block can be commensurate with that of the feature map received from the corresponding filter block.
As shown in FIG. 3, the 4-th FPN block receives the (1-4)-th feature map with a size and a channel of W/16×H/16×8C from the (1-4)-th filter block and adjusts the number of channels of the (1-4)-th feature map from 8C to D without modifying the size thereof. Also, the 4-th FPN block receives the (P−5)-th feature map with a size and a channel of W/32×H/32×D from the 5-th FPN block and rescales the size of the (P−5) feature map to W/16×H/16. Then, the (P−4)-th feature map with a size and a channel of W/16×H/16×D is generated and carried to the 3-rd FPN block, and so on. Each of the rest of the FPN blocks follows the same procedure described above to ultimately output the (P−1)-th feature map with a size and a channel of W/2×H/2×D. However, massive amount of computations is required for the feature computation block 101 including the FPN blocks.
Accordingly, the applicant of the present invention intends to disclose a novel method for generating feature maps with a high degree of accuracy of a result of an application while reducing computation time.