In Machine Learning, a convolutional neural network (CNN or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
FIG. 1 is a drawing schematically illustrating a learning process of a conventional CNN according to prior art. Specifically, the figure shows the process of comparing a bounding box predicted or estimated by a learning device with a Ground Truth (GT) bounding box.
Referring to FIG. 1, the process of the learning device conventionally acquiring losses by comparing the predicted bounding box with the GT bounding box will be delineated. Herein, the losses stand for differences between the predicted bounding box and the GT bounding box, and are denoted as dxc, dyc, dw, dh in FIG. 1.
First, as illustrated in FIG. 1, the learning device may acquire an RGB image as an input to be fed into a plurality of convolutional layers, i.e., convolutional filters, included in a convolution block. A size, e.g., width and height, of the RGB image becomes smaller and smaller in the width and the height while the number of channels is incremented as the RGB image passes through the plurality of convolutional layers.
As illustrated in FIG. 1, the learning device allows a Region Proposal Network (RPN) to generate proposal boxes from an ultimate feature map outputted by the convolution block and allows a pooling layer, e.g., ROI pooling layer, to resize areas on the feature map corresponding to the proposal boxes to a predetermined size, e.g., a size of 2×2, by applying a max pooling operation (or an average pooling operation) to pixel data of the areas on the feature map corresponding to the proposal boxes. As a result, a pooled feature map is acquired. Herein, the max pooling operation is an operation by which the maximum value in each of sub-regions divided from an area to be used for pooling operations is selected as respective representative values for the respective sub-regions, as shown in the bottom right of FIG. 1.
Next, the pooled feature map may be allowed to be fed into a fully connected (FC) layer as an input of the FC layer. Also, the learning device may allow the FC layer to recognize a type or a category of an object in the RGB image. For a reference, the pooled feature map may also be referred to as a feature vector.
In addition, the predicted bounding box in the RGB image may be acquired through the FC layer, and the losses may also be acquired by comparing between the predicted bounding box and the ground truth (GT) bounding box. Herein, the GT bounding box represents a bounding box precisely surrounding the object in the RGB image, which may usually be prepared by a human being.
Lastly, the learning device in FIG. 1 may adjust at least one of parameters included in the FC layer, the RPN, or the plurality of convolutional layers to reduce the losses during a backpropagation process. After adjusting the parameters, a testing device may acquire a bounding box surrounding an object in a test image, later.
However, the testing device including the CNN with the adjusted parameters may not acquire the bounding box precisely surrounding the object in the test image yet. Since a smallest sized feature map generated as a result of applying convolution operations multiple times to the test image is generally used, it is insufficient to express the object with the smallest sized feature map.
Accordingly, the applicant of the present invention proposes a learning method, a learning device for acquiring a bounding box with high precision from a plurality of multi-scaled feature maps, and a testing method and a testing device using the same are disclosed herein as well.