In Machine Learning, a convolutional neural network (CNN or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
FIG. 1 is a drawing schematically illustrating a learning process of a conventional CNN according to prior art.
Specifically, FIG. 1 shows a process of acquiring losses by comparing a predicted bounding box with a Ground Truth (GT) bounding box. Herein, the losses stand for differences between the predicted bounding box and the GT bounding box, and are denoted as dxc, dye, dw, dh as shown in FIG. 1.
First, as illustrated in FIG. 1, the learning device may acquire an RGB image as an input to be fed into a plurality of convolutional layers, i.e., convolutional filters, included in a convolution block. A size, e.g., a width and a height, of the RGB image becomes smaller while the number of channels is incremented as the RGB image passes through the plurality of convolutional layers.
As illustrated in FIG. 1, the learning device allows a Region Proposal Network (RPN) to generate proposal boxes from an ultimate feature map outputted by the convolution block and allows a pooling layer, e.g., ROI pooling layer, to resize areas on the feature map corresponding to the proposal boxes to a predetermined size, e.g., a size of 2×2, by applying a max pooling operation (or an average pooling operation) to pixel data of the areas on the feature map corresponding to the proposal boxes. As a result, a pooled feature map is acquired. For a reference, the pooled feature map may also be referred to as a feature vector. Herein, the max pooling operation is an operation by which each maximum value in each of sub-regions divided from a subject area on a feature map is selected as each of representative values for the subject area, as shown in the bottom right of FIG. 1.
Next, the pooled feature map may be allowed to be fed into a fully connected (FC) layer.
Then, the learning device may allow the FC layer to recognize a category of an object in the RGB image. In addition, the predicted bounding box in the RGB image may be acquired through the FC layer, and the losses may also be acquired by comparing between the predicted bounding box and the ground truth (GT) bounding box. Herein, the GT bounding box represents a bounding box precisely surrounding the object in the RGB image, which may usually be prepared by a human being.
Lastly, the learning device in FIG. 1 may adjust at least one of parameters included in the FC layer, the RPN, or the plurality of convolutional layers by using the losses during a backpropagation process.
Thereafter, a testing device (not illustrated) having the CNN with the adjusted parameters may acquire a bounding box surrounding an object in a test image, later. However, even if the testing device has the CNN with the adjusted parameters, it is very difficult to obtain the bounding box precisely surrounding the object in the test image.
Accordingly, the applicant of the present invention proposes a method for acquiring at least one bounding box corresponding to at least one object in a test image with high precision.