Some embodiments of the presently disclosed subject matter relate to automated driving of vehicles. In particular, some embodiments relate to a processing method of a three-dimensional (3D) point cloud derived using a Lidar or the like. Some embodiments further relate to a method for classification of an object of interest within a 3D point cloud based on such processing method and using machine learning.
The methods of some embodiments are useful especially in the field of human-assisted or autonomous vehicles using a depth sensor such as a Lidar sensor for obstacle detection and avoidance to navigate safely through environments. A Lidar sensor measures distance by illuminating a target with a laser light.
A related art publication, “Convolutional-Recursive Deep learning for 3D object classification”, Richard Socher, Brody Huval, Bharath Bhat, Christopher D. Manning, Andrew Y. Ng, NIPS 2012, describes a method for 3D object classification using convolution-recursive deep learning. Input to the classification system needs RGB-D information, which is composed of RGB (red, green, blue) data from camera and D (distance) from depth sensor (STEREO, TOF, LIDAR, etc.). Such a method presents several drawbacks among which the use of several sensors, namely a camera and a depth sensor, and the need of integrating together data from the camera (image) with the depth sensor (coordinates) for feeding the deep learning.
Another related art publication, “Towards 3D object recognition via classification of arbitrary object tracks”, Alex Teichman, Jesse Levinson, and Sebastian Thrun (ICRA 2011), describes a method of object recognition. In the described method, each source of data is used to compute hand-crafted features in sequential manner, followed by machine learning classification in series of cascades. This processing flow is optimized to solve mostly generic highly repetitive cases, but which reveals to have poor performance in less generic situations due to limited flexibility in parameter tuning that cannot be optimal for all situations at the same time. Finally, only intensity data is presented in form of 2D image maps, which limits parallel access and processing to single source of information.
Another related art European patent publication, EP 2 958 049 A2 describes a method of extracting feature regions from point cloud. The described method uses hand-crafted process for key point selection and descriptor computation for corresponding voxel, which is later classified by machine learning algorithm. Such a processing pipeline, where features are pre-selected, could not let Deep Neural Network (DNN) to realize its potential in automatically finding features since significant part of the information was pre-filtered due to the choice of hand-crafted methods. Such pre-filtering could be beneficial for certain types of situations, while penalizing in many other real world situations. Further the described method does not compensate for low density of point cloud data, therefore having lower recognition accuracy for objects at far distance.
Another related art publication, “Obstacle Classification and 3D Measurement in Unstructured Environments Based on ToF Cameras” by Yu et al., describes an obstacle detection and classification method based on the use of Time-of-Flight (ToF) cameras for robotic navigation in unstructured environments. While using different kind of sensors (TOF vs. Lidar), intensity measurement is performed using a SR-3000 sensor controlled as a so-called 1-tap sensor. This means that in order to obtain reliable distance information, four consecutive exposures have to be performed. Fast moving targets in the scene may therefore cause errors in the distance calculation. According to the system parameters defined in the manual of the SR-3000 sensor (aiweb.techfak.uni-bielefeld.degiles/SR3000_manual_V1.03.pdf), the described method uses intensity up to 7.5 m at 850 nm wavelength. Further, the described method is based on a feature engineering approach, where all features are handcrafted so that the way how data is combined remained unchangeable independently of training data. It never or rarely uses combination of intensity and 3D information to form features for object recognition, but uses instead either 3D data or either 2D data processing consequently, while by having direct correspondence between intensity value and 3D measurement for each pixel. Furthermore, the described method uses intensity to filter noise during segmentation of regions of interest to improve 3D obstacle clustering. More particularly, the described method uses 4 separate channels I, X, Y, Z, which forms several spaces and cannot be processed easily and efficiently by convolution deep neural networks without additional processing stage.
These related art publications address the problem of object classification by combination of data from multiple sensors and/or hand-crafted processes of feature selection, which gives acceptable results in common everyday situations. However, combination of sensors practically limits the conditions of use to situations where all used sensors could efficiently capture signal (e.g. day, good weather conditions, etc.) and fail to deliver accurate results in situations where one of the sensors cannot resolve the captured signal (e.g. camera at night, rainy conditions, etc.). Further, hand-crafted feature selection process could only be tuned to achieve maximum performance to very few typical situations (e.g. most common ones), while in others (rare/unique) cases it cannot achieve the same performance due to absence of hand-crafted features needed for robust classification.