In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
FIG. 1 is a diagram schematically illustrating a learning process using a conventional CNN, which compares a predicted bounding box with a ground truth (GT) bounding box to thereby acquire loss values. For example, the loss values may include dxc, dyc, dw, and dh as shown in FIG. 1.
First, a convolutional layer, which includes at least one convolutional filter, of the conventional learning device as illustrated in FIG. 1 may receive a training image, e.g., an RGB image, of an object and then create at least one feature map by using the training image. A width and a height of the feature map may decrease as it passes through the convolutional layer, but the number of channels thereof may increase.
Next, when the feature map is inputted into a region proposal network (RPN), the conventional learning device may allow the RPN to acquire at least one region of interest (ROI). In detail, if the feature map is inputted into the RPN, the RPN may create one or more anchor boxes and determine specific anchor boxes which matches the GT bounding box with a degree of being equal to or greater than a predetermined threshold among the anchor boxes as the ROI by comparing each of the anchor boxes with the GT bounding box.
Then, the conventional learning device may allow a pooling layer to apply either max pooling or average pooling operation to pixel data, corresponding to the ROI, on the feature map. Herein, the max pooling may partition the feature map into a set of non-overlapping sub-regions and, for each of the sub-regions, output its corresponding maximum value among values of pixels in each of the sub-regions, and the average pooling may partition the feature map into a set of non-overlapping sub-regions and, for each of the sub-regions, output its corresponding average value.
Next, the conventional learning device in FIG. 1 may perform processes of (i) inputting a pooled feature map, acquired as a result of the max pooling or the average pooling, into a fully connected (FC) layer and (ii) allowing the FC layer to confirm a type, i.e., a class, of the object by applying classification operations to the pooled feature map. For reference, the pooled feature map may be called as a feature vector.
Further, the conventional learning device in FIG. 1 may allow the FC layer to acquire a bounding box on the training image, and then allow a loss layer to acquire loss values which represent difference between the acquired bounding box and the GT bounding box. Herein, the GT bounding box may be a bounding box including exactly the object in the training image and may be created by a human in general.
Finally, the conventional learning device in FIG. 1 may adjust at least part of one or more parameters of the FC layer, one or more parameters of the RPN, and one or more parameters of the convolutional layer to reduce the loss values during a backpropagation process. By adjusting the parameters, an accuracy of acquiring the bounding box in a test image later may be improved.
Conventionally, the pooling layer may apply pooling operations to an area, corresponding to the ROI determined by the RPN, on the feature map. However, because the ROI may not include the object exactly, the features pooled from this area may not be the desired features of the object. Therefore, such pooled features may have a bad influence on learning of the CNN.
Consequently, the inventor of the present invention proposes a technique that utilizes additional GT ROI beside the conventional ROI in learning.