Region of interest (ROI) pooling is a key operation widely used by modern Convolutional Neural Network (CNN) based object detectors. Most state-of-the-art object detectors based on CNN employ a two stage detection framework in which the first stage produces a set of rectangular object proposals, each with an objectness score. These proposals are represented in bounding boxes of different aspect ratios and sizes. The proposals are class-agnostic and coarse and, therefore, need subsequent per-proposal classification and refinement in the second stage. Region of Interest (ROI) pooling is used to transform feature representations of these proposals to obtain fixed-size feature maps (e.g. 7×7 in spatial extent).
An input image is forwarded through several convolution layers (e.g., a CNN) to generate a convolutional feature map of size C×H×W, where C, H and W denote the depth (i.e. number of channels), height and width of the feature map. Given this feature map as input, a region proposal generator, which could be an external proposal method or internal sub-network, outputs a set of proposals of objects within the image. The proposals are of non-uniform sizes and are projected onto the feature map to produce a fixed-size feature map for each region proposal via the ROI pooling operation. Conventional ROI pooling operations are both computationally and memory intensive, making the conventional ROI pooling operations difficult to deploy on an embedded device having limited hardware resources and power budgets.
It would be desirable to implement a new region of interest pooling method for fast and energy-efficient object detection with convolutional neural network.