In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
A CNN-based object detector may (i) instruct one or more convolutional layers to apply convolution operations to an input image, to thereby generate a feature map corresponding to the input image, (ii) instruct an RPN (Region Proposal Network) to identify proposals corresponding to an object in the input image by using the feature map, (iii) instruct a pooling layer to apply at least one pooling operation to areas on the feature map corresponding to the identified proposals, to thereby generate one or more pooled feature maps, and (iv) instruct an FC (Fully Connected) layer to apply at least one fully connected operation to the acquired pooled feature maps to output class information and regression information for the object, to thereby detect the object on the input image.
However, since the CNN-based object detector uses the feature map whose size is reduced from a size of the input image by the convolutional layer, it is difficult to detect a small-sized object in the input image although a large-sized object in the input image can be easily detected.
That is, if there are multiple target regions corresponding to one or more objects as subjects to be detected in the input image, desired features may not be extracted accurately from some of target regions due to sizes thereof, and as a result, certain objects cannot be detected.
Such a problem may be resolved by object detection via cropping each of the target regions in each of the images among an image pyramid derived from the input image, but in this case, the object detection must be performed for each of the cropped images corresponding to the target regions, thus computational load may increase.
In addition to this, a CNN operation is a block operation, e.g., an operation by a unit of 32, 64, 128, etc., for fast calculation, but if an input image whose width or height is not a multiple of the unit is acquired, one or more padding regions must be added to make it be a multiple of the unit, but this becomes a burden to the CNN operation. As a result, the more there are cropped images whose width or height is not a multiple of the unit, the heavier the burden on the CNN, which slows down the calculation speed of the CNN.
Accordingly, the inventors of the present disclosure propose a learning method, a learning device for efficiently detecting objects and reducing computational time of the CNN, by using the target regions corresponding to the objects with various sizes in the input image, and a testing method and a testing device using the same.