The present disclosure relates to object recognition using layer-based object detection with deep convolutional neural networks.
Today many computer systems and machines rely on person recognition techniques for various different applications. In some example applications, machines and computer systems need to know if there is a human present (or which human is present) at a particular location in order to turn on/off or activate a particular program. Person detection in particular is often a fundamental skill in human robot interaction. In general, a robot needs to know where the person is in order to interact with them.
While some progress has been made at detecting people in public places (e.g., see P. F. Felzenszwalb, R. B. Girshick, D. McAllester and D. Ramanan, “Object Detection with Discriminatively Trained Part Based Models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, 2010; and T. Linder and Arras K. O., “People Detection, Tracking and Visualization using ROS on a Mobile Service Robot,” in Robot Operating System (ROS): The Complete Reference, Springer, 2016), in other domains, such as a home environment, the challenges are particularly difficult.
One solution that has been developed to improve object recognition is to use a layer-based classification/object detection system to differentiate between classes of objects. The layer-based classification uses a segmented depth image to differentiate between two or more classes of people. However, one common error that is present in the layer-based classification using depth images, especially in a moving object detection system (such as a robot) when the system approaches a square object at an off angle (e.g. 45 degrees), then that object will appear curved in the depth image, making it difficult to distinguish from people. In moving object detection systems, a false positive classification that happens when a robot approaches an object at an unusual angle can result in a robot becoming stuck.
Another solution that has been developed to improve object recognition is to use a deep convolutional neural network to classify objects and/or images. The use of deep convolutional neural networks to classify objects and/or images is a relatively recent phenomenon. Although the algorithm itself is many decades old, there has been significant recent work in optimizing these algorithms for large data sets and improving their speed and precision. Most notably, work published by Krizhevsky, Sutskever and Hinton at the University of Toronto detailed a specific network architecture, referred to as “AlexNet,” that performed well on the large object recognition challenge, ImageNet. See A. Krizhevsky, I. Sutskever and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Neural Information Processing (NIPS), Lake Tahoe, Nev., USA, 2012.
The deep convolutional neural network often utilizes RGB images to classify objects and/or images. While recent improvements to the deep convolutional neural network have shown success at large object image recognition as well as increasing the size of the training set and tolerance of noise, the deep convolutional neural network suffers from a significant weakness. The deep convolutional neural network is overly reliant on single sending modality (e.g. RGB image data). Not only is segmenting in RGB much more difficult and computationally expensive, but the classifier itself emphasizes learning a decision boundary based on edges and textures, features that may not be the only, or even the best, choice depending on the sensing modality and the object being recognized.
In addition, AlexNet, however, does not solve the segmentation problem—when restricted to color images, other algorithms like graph cuts were used to extract object-bounding boxes and then classified. See R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition, Columbus, Ohio, 2014. Alternatively, there has been a notable effort to move beyond the single modality limitations by incorporating depth. See C. Couprie, C. Farabet, L. Najman and Y. LeCun, “Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos,” Journal of Machine Learning Research, vol. 15 (October), pp. 3489-3511, 2014.
Couprie et al exponentially reduce the number of bounding boxes to evaluate by applying watershed cuts to depth images for image segmentation prior to RGB classification. Gupta et al go step further by including the depth data in the segmentation and classification step. See S. Gupta, R. Girshick, P. Arbeláez and J. Malik, “Learning Rich Features from RGB-D Images for Object Detection and Segmentation,” in European Conference on Computer Vision, Zurich, Switzerland, 2014. Their work, however, requires knowledge of the camera orientation in order to estimate both height above ground and angle with gravity for every pixel in the image for use with AlexNet.
Within the domain of person detection there is also multimodal fusion work focused on improving specific classifiers using a combination of RGB and depth information. People can be detected in depth data alone, as demonstrated by previous work in layered person detection and contour estimation, as well as and they can be detected in monocular camera data, either color or grayscale images. See E. Martinson, “Detecting Occluded People for Robotic Guidance,” in Robots and Human Interactive Communication (RO-MAN), Edinburgh, UK, 2014; L. Spinello, K. Arras, R. Triebel and R. Siewart, “A Layered Approach to People Detection in 3D Range Data,” in Proc. of the AAAI Conf. on Artificial Intelligence: Physically Grounded AI Track, Atlanta, Ga., 2010; and N. Kirchner, A. Alempijevic, A. Virgona, X. Dai, P. Ploger and R. Venkat, “A Robust People Detection. Tracking and Counting System,” in Australian Conf. on Robotics and Automation, Melbourne, Australia, 2014.
But the advantage of using the two modalities is that the failure points for depth-based recognition are not the same as the failure points for color based recognition. Given a registered color and depth image, a number of systems have been developed to take advantage of the fusion of these two modalities.
The method described by Spinello and Arras (Univ. of Freiburg) fuses these two modalities by applying similar classifiers in each domain. See L. Spinello and K. Arras, “People Detection in RGB-D Data,” in Int. Conf. on Intelligent Robots and Systems (IROS), San Francisco, USA, 2011. The depth image is used to first identify regions of interest based on groups of neighboring pixels. Then the histogram of oriented gradients, originally developed for object recognition in RGB images, and widely used in color based person detection, is calculated for regions of interest in the color image. A second, related algorithm, the histogram of oriented depths, is then applied to the depth image objects, and the resulting combined vector is classified using a support vector machine. More recent work from Freiburg (see above) integrates other publically available detectors including one included with the point cloud library. See M. Munaro, F. Basso, E. Menegatti., “Tracking people within groups with RGB-D data,” in International Conference on Intelligent Robots and Systems (IROS) 2012, Villamoura, Portugal, 2012.
Another related RGB-D classification system was published by the University of Michigan, whereby additional modalities such as motion cues, skin color, and detected faces are also added to a combined classification system. See W. Choi, C. Pantofaru, S. Savarese, “Detecting and Tracking People using an RGB-D Camera via Multiple Detector Fusion,” in Workshop on Challenges and Opportunities in Robot Perception (in conjunction with ICCV-11), 2011. Although both of these methods make use of classifiers from both the RGB and depth domain, neither one takes advantage of the precision increase a convolutional neural network can enable. Where the first method uses two very similar classifiers (HOG vs HOD) to handle the cross-domain fusion, the system is learning the same decision boundary and will fail when that decision boundary is difficult to identify. The second method, by contrast, employs a variety of different detectors across different domains. However, the majority (e.g. motion cues, skin color, and face detection) are very weak classifiers in the general detection problem, as opposed to convolutional neural networks.
Therefore, a solution is needed that both reduces errors in classifying depth images, without the increased process time and computational difficulty of the convolutional neural network.