In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
FIG. 1 is a drawing illustrating a learning process using the CNN according to a conventional art.
FIG. 1 illustrates a process of comparing a bounding box estimated by a learning device with that of its ground truth (GT).
Referring to FIG. 1, a conventional learning device estimates the bounding box and then compares the estimated bounding box with the GT's bounding box, to thereby obtain one or more loss values. Herein, the loss values mean values of difference between the estimated bounding box and the GT's bounding box. For example, the loss values may include dxc,dyc,dw,dh as shown in FIG. 1.
First of all, the learning device of FIG. 1 obtains an RGB image and inputs the RGB image to a convolutional layer. After the RGB image passes through the convolutional layer, a feature map is generated such that a width and a height of the RGB image are decreased and the number of channels is increased.
The learning device of FIG. 1 may allow the feature map to be fed into a region proposal network (RPN) to generate proposal boxes and then may apply either max pooling operations or average pooling operations to pixel data included in areas, corresponding to the proposal boxes, on the feature map, to thereby generate a pooled feature map. Herein, the max pooling is a method of selecting each largest value in each of sub-regions in its corresponding proposal box as each of representative values per each of the sub-regions and the average pooling is a method of calculating each average value in each of the sub-regions in its corresponding proposal box as each of representative values per each of the sub-regions.
Then, the learning device of FIG. 1 may input the pooled feature map to a fully connected (FC) layer. Herein, the learning device allows the FC layer to check a type of the object on the RGB image through classification operations. The pooled feature map may also be referred to as a “feature vector.”
Further, one or more bounding boxes on the inputted RGB image may be estimated by the FC layer, and the loss values, which are obtained by comparing the estimated bounding boxes with their GT's bounding boxes. Herein, the GT's bounding boxes are bounding boxes accurately including the object on the image, which may be generated directly by a person.
Lastly, the learning device of FIG. 1 may adjust at least one parameter of the FC layer, at least one parameter of the RPN, and at least one parameter of the convolutional layer in order to reduce the loss values while performing the backpropagation. After the parameters of the CNN are adjusted, new bounding boxes corresponding to a new object on a test image may be estimated.
However, if areas, corresponding to ROIs, on a feature map outputted from the convolutional layer are pooled with a single scale, information expressed by the feature map pooled with the single scale is limited. Thus, a larger number of features need to be used to accurately detect the object. Accordingly, the number of operations required for detecting the object increases and the performance is deteriorated.
Hence, the present application proposes a method for reducing the number of operations by allowing each of pooling layers having each different scale to perform each of pooling operations.