Systems and methods herein generally relate to image recognition using a neural network and, more particularly, to training convolutional neural networks for large size input images.
Recently, deep learning has attracted more attention in the computer vision community because of its performance in terms of classification, detection, and recognition accuracy. However, there is a technical issue in the training and testing of the Convolutional Neural Networks (CNNs) that are used for image classification and detection: the prevalent CNNs require a fixed input image size (e.g., 256×256), which limits both the aspect ratio and the scale of the input image. When applied to images of arbitrary sizes, most current methods fit the input image to the required size by cropping or warping of the input image. CNNs require a fixed input size because a CNN mainly consists of two parts: convolutional layers and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps that represent the spatial arrangement of the activations and the spatial scale of the activations. In fact, convolutional layers do not require a fixed image size and can generate feature maps of any size. On the other hand, the fully-connected layers need to have fixed size/length input by their definition. Hence, the fixed size constraint comes only from the fully-connected layers, which exist at a deeper stage of the network. A simple approach to accommodate larger input image size is to modify the parameters of the convolutional filters so that the output at the last convolutional layer will fit the size requirement of the fully connected layers. However, the scale of the spatial features extracted will then vary depending on the input image size. Another approach is to replace the pooling layers in the current network with a spatial pyramid pooling. Spatial pyramid pooling can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of most prevalent deep networks, where the number of sliding windows depends on the input size. In the method utilizing spatial pyramid pooling, in each spatial bin, the responses of each filter (e.g., max pooling) was pooled. The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer. This approach maintains the size of the fully connected layer by using spatial pyramid pooling at the last convolutional layer.