Self-driving vehicles are beginning to be test-driven on public roads, but their records have been marred with (so far) minor incidents. One impediment to their widespread adoption is their occasional inability to recognize the objects surrounding them as they move. At the heart of the issue is the efficacy of the machine vision the vehicles employ to recognize surrounding objects.
Machine vision is carried out using machine learning models, which require training on large datasets of images featuring a particular “target” object of interest. For training to be effective, the datasets should be sufficiently large to feature enough examples of variations of the target object. Variations may be in terms of shape, size, color, perspective, and orientation, for example. In addition, the example images are annotated in a way that distinguishes the target object from the background or other objects in the scene.
In the automotive field, training an object detector (e.g., vehicle or pedestrian detector) requires tens of thousands of examples of the target object. The difficulty in obtaining the dataset is the large number of factors associated with gathering the images. Some of the factors include variations in the type of environment (urban, suburban or rural), weather conditions, lighting conditions, and perspectives of the target object. Gathering such a large dataset has conventionally required equipping a vehicle with one or more image capturing devices (e.g., a camera), recording equipment, and data storage.
Furthermore, for the gathered dataset to be useful for training it must be fully annotated. “Ground truth” selection of the target object in each image must be created, which guides the machine learning model in recognizing the object. Ground truth data includes various attributes of an object in a given scene such as, but not limited to, its position, size, occlusion level, presence within a group of other objects, and orientation.
All known current solutions require driving an equipped vehicle through the various environmental, weather, lighting, and perspective conditions necessary for obtaining a diverse dataset. The resulting images are then manually annotated with ground truth data for each image where the target object is present.