In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural network that has successfully been applied to analyzing visual imagery.
FIG. 1 is a drawing schematically illustrating a testing method of an object detector by using a conventional R-CNN.
First, a testing device as illustrated in FIG. 1 may acquire an RGB image 101 as an input to be fed into one or more convolutional layers 102, i.e., convolutional filters, included in a convolution block. As the RGB image passes through the convolutional layers, a size, e.g., width and height, of the RGB image becomes smaller in its width and its height while the number of channels is increased.
Next, the testing device may pass a feature map 103 through a learned RPN (Region Proposal Network), to thereby generate an ROI 105, and may instruct a pooling layer 106 to resize a pixel data included in regions corresponding to the ROI 105 on the feature map by applying one of a max pooling operation or an average pooling operation to the regions, to resize the pixel data and generate a feature vector by referring to the resized feature map.
Then, the testing device may input the feature vector into a learned FC (Fully Connected) layer 108 to determine types of objects on the inputted RGB image by classification operation, etc., and may generate a bounding box on the inputted RGB image by using the FC layer.
According to a method for detecting the object by using a conventional R-CNN, ROI proposals are acquired by using anchor boxes. Herein, the anchor boxes have a variety of scales and aspect ratios in order to find various objects which have various sizes and shapes.
However, it has a drawback in that when the pooling layer pools the feature map, the pooling operation is performed with only single scale and single aspect ratio without considering various sizes and shapes of the objects. Thus, it cannot detect the objects precisely.