The deep learning (DL) field has gained significant attention over the last few years due to its implementation in a wide variety of applications. Deep neural networks (DNNs) require numerous training samples to converge to and produce significant output results. These training samples must be annotated and must be of very high quality for the successful implementation of such DNNs. One challenge is producing large numbers of training samples with high-quality annotated data.
One particular application of DNNs is in the training and testing of DA and AD systems in the automotive and active safety fields. These systems require large amounts of annotated image data to train and test the various functionalities, such as object detection, drivable surface recognition, semantic segmentation, and object tracking using camera and video images. The main challenge in using manually-annotated image data for these tasks is twofold. First, manually-annotated image data is expensive in terms of the time and expertise required to segment regions of interest (ROIs). Thus, there is a need to obtain fast and accurate semantic proposals from minimally-supervised and/or active learning algorithms that reduce manual annotation time and cost. Second, the scalability of algorithms across datasets is often a challenge. A proposal generation algorithm that works with one dataset may not provide the same performance with another dataset. Thus, there is a need for a generalizable proposal algorithm with low computational complexity.
Typically, DL algorithms are capable of extracting high-level features from images and videos by tapping into the local and global-level spatial characteristics. However, such DL algorithms require a large number of training samples to adequately learn and perform. As an alternative, ESNs are capable of high-level feature abstraction from a small number of image frames. ESNs have been studied quite extensively and have been applied in numerous fields for other purposes. The primary assumption made related to existing ESNs for semantic image segmentation is that all images in the dataset under consideration have similar spatial orientations and segmentation objectives. This assumption leads to the inherent property of ESNs that, at the end of a training batch of images, the reservoir nodes achieve a steady state, regardless of the initial conditions. However, semantic segmentation tasks cannot always assure similar spatial and/or intensity orientations and often the segmentation objectives can vary. For instance, images acquired by mobile phones with centralized objects of interest for foreground segmentation tasks must be treated separately from wide-angle scenery images acquired from vehicle cameras with the objectives of object and drivable surface segmentation tasks.
More generally, various works have been developed over the years to incorporate image segmentation using graph cut-techniques. These works segment each image into several super-pixel regions, followed by the identification of each region as a positive or negative bag region. All regions that are identified as positive bags are combined to generate the desired ROI. The disadvantage of this process is that it is slow, due to the super-pixel implementation, and lacks scalability. Other works utilize probabilistic measures and textural features in image sub-regions to decide whether to include each sub-region in the ROI. This process utilizes pixel intensity and texture along with a graph-based approach for image segmentation. The process, although minimally supervised, relies heavily on handcrafted intensity and textural features and fails to extract high-level features from multiple image planes.
Other works propose the use of super-pixel segmentation followed by the implementation of a density-based spatial clustering of applications with noise (DBSCAN) algorithm and some spatial priors to segment a ROI. This process is again slow due to the super-pixel segmentation process and lacks scalability. Other works introduce a two-round active learning framework to learn from video sequences and enable object racking using a dataset. This framework performs offline learning using wavelets and a classifier and applies particle filtering to track multiple objects. This process lacks scalability across datasets and requires large sets of training data (e.g., several thousand samples of objects) to update the system.
Other works first isolate object bounding boxes using a bird's-eye-view (BEV) plane followed by feature extraction and a classifier to eliminate false detections. This method also requires a large number of training samples for classification and lacks generalizability. Several benchmarking datasets have been publicly made available for object segmentation tasks, such as the Weizmann 1 and 2 object segmentation datasets, the ADE 20K dataset from MIT, and the LISA vehicle detection dataset. Some prior works have focused on utilizing ESNs for image segmentation by utilizing the reservoir states as features for classification and readout. However, all of these methods lack scalability and fail to utilize spatial neighborhood-based features for high-level feature abstraction.