Object detection becomes a hot topic in the computer vision field, and is aimed at recognizing and detecting each object instance of a category of interest in an image or a video. The object detection is an important part in various applications such as automatic driving, unmanned aerial vehicles and gesture-based interaction systems. A highly efficient camera, a real-time vision processing algorithm based on an embedded high power efficient processor and the stable performance all are critical to the practical application of object detection.
In many important application scenarios such as automatic driving, unmanned aerial vehicles, family assistances and gesture-based man-machine interaction, the object detection technologies play a core role. Conventional object detection methods use a variable component model and variants thereof as a mainstream. In such methods, by using image descriptors such as Histogram of Oriented Gradient (HOG), Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP) as features, the whole image is traversed by a sliding window to find a maximum response region of a certain category.
Recently, with the rapid development of the deep learning technology, the object detection technologies based on deep neural networks have become mainstream technical methods in this field due to their remarkable performance. At present, majority of the object detection technologies based on deep neural networks are established under the framework of a Faster Region Convolutional Neural Network (FRCNN): first, a serial of convolution operations are performed on an input image to obtain a feature map; then, according to the position and scale of a Region of Interest (ROI) in the feature map, a feature having a fixed length is dynamically pooled from the image feature map as the feature of this ROI; and finally, an object in the ROI is classified by using the feature of the ROI and a bounding box for this object is regressed.
Although such methods based on convolutional neural networks have excellent detection performance, the methods are generally run on a GPU only since a large amount of storage spaces and computing resources are required. Consequently, the requirements of applications of embedded electronic apparatuses cannot be satisfied.
In order to increase the speed of the detection algorithm, there are some more efficient network structures. In such methods, instead of depending on the ROI-based dynamic pooling, an object is directly classified by the feature of each point in an image feature map, and parameters for a bounding box for this object are regressed. Compared with the FRCNN detection model, such methods can increase the speed by 2.5 times while ensuring the accuracy, or increase the speed by 8.6 times while reducing the accuracy by about 10%. Despite this, there is still a gap of dozens of times from the requirements of the high-efficient embedded applications.
For the practical applications such as automatic driving, unmanned aerial vehicles, family assistances and gesture-based interaction systems, high power efficiency is the prerequisite for the extensive use of the object detection algorithm. However, although the detection methods based on convolutional neural networks have excellent detection performance, the methods are generally run on a GPU only since a large amount of storage spaces and computing resources are required. Consequently, the requirements of applications of embedded electronic apparatus cannot be satisfied.
A Dynamic Vision Sensor (DVS) camera has the characteristic of high power efficiency. However, since the existing object detection algorithms based on neural networks are all high in complexity, the power consumption of the whole vision detection system is still very high. As a result, the requirements of the practical applications cannot be satisfied.
Unlike the images generated by a conventional Complementary Metal Oxide Semiconductor (CMOS) or Charge-coupled Device (CCD) sensor, a DVS sensor generates events according to the change in illumination intensity in a scene, and the generated images have the characteristics of sparsity and binarization.