A vehicle (such as a fully- or partially-autonomous or self-driving vehicle) typically includes one or more sensors to detect and sense an environment in which the vehicle is located and through which the vehicle may be moving. The sensed data (along with other data, in some cases) may be utilized to control vehicle operations and maneuvers. The sensors may be any type or types of sensors which are capable of sensing various objects and/or conditions within the vehicle's environment, such as lidar, radar, cameras, and/or other types of sensors. The vehicle may also include other sensor devices, such as inertial measurement units (IMUs), and/or include other types of devices that provide information on the current position of the vehicle (e.g., a GPS unit).
The data generated by the sensors (and possibly other data) may be processed by a perception component of the autonomous vehicle, which outputs signals indicative of the current state of the vehicle's environment. The output signals generated by the perception component of the vehicle may be utilized to control various driving operations and maneuvers of the vehicle (e.g., steering direction, speed, braking force, etc.). In an example implementation, the perception component may identify (and possibly classify and/or track) objects within the vehicle's environment. As a more specific example implementation, the perception component may include (1) a segmentation module that partitions or distinguishes various objects within images that have been obtained via the various sensors to correspond to probable objects, (2) a classification module that determines labels/classes for the segmented objects, and (3) a tracking module that tracks segmented and/or classified objects over time (e.g., across image frames). For example, based on data provided by one or more of the vehicle sensors, the perception component may discern, identify, classify, and/or track the presence and positions of objects or particular types thereof within the vehicle's environment, and/or may track the configuration of the road (and any objects thereon) ahead of the vehicle. As such, one or more autonomous or self-driving behaviors of the vehicle may be controlled based on the objects that are segmented, classified, and/or tracked by the perception component of the vehicle over time.
In some embodiments, one or more machine-learning based models are trained and utilized by the autonomous vehicle to control the perception component to identify, classify, and/or track objects within the vehicle's environment, and as such, are referred to herein as one or more “perception models.” The perception models may be trained using any of various suitable types of learning, such as supervised learning, and may be trained using real-world image data and/or image data generated in a simulated environment that have been labeled according to “correct” outputs of one or more perception functions (e.g., segmentation, classification, and/or tracking). In some configurations, different models are utilized by each of the segmentation module, the classification module, and the tracking module. In some configurations, the segmentation, classification, and/or tracking module utilize one or more common models.
Currently known techniques for identifying and labeling objects for the purposes of generating training data for training autonomous vehicle control models require a human using a computer tool to indicate and label objects within conventional, two-dimensional (“2-D”) visual images of vehicle environments, e.g., images that have been generated by a passive imaging device or system, for example, an optical system that uses a lens and a diaphragm and/or filter or other sensors to passively sense, detect, and capture the colors, intensities, etc. of incoming rays of light that are visible to the human eye and that have reflected off of objects within the vehicle environments. For example, conventional, 2-D visual images may be stored in data file formats such as JPEG, Exif, PNG, etc., and conventional dynamic 2-D visual images or videos may be stored in data file formats such as AVI, QuickTime, GIF, etc. Typically, to indicate and label objects that are depicted in conventional, 2-D visual images, a frame of a 2-D image is presented on a user interface. A human may utilize controls provided by the user interface to place a box around an object within the image (e.g., to “bound” the object, or to “place a bounding box around” the object), thereby distinguishing the object from other objects within the image, and provide a respective label for the bounded object (e.g., “car”, “person,” “bicycle,” etc.). A conventional, 2-D visual image may be manipulated by the user in two dimensions, such as by zooming in, zooming out, or translating, to aid the user in bounding the object.
As is commonly known, thousands, if not millions, of labeled image data frames are needed to sufficiently train a vehicle's perception component to be able to identify, classify, and/or track objects within the vehicle's environment with enough accuracy and within a short enough time window to allow for safe control and operation of the vehicle during a variety of driving conditions. Thus, each object depicted within these thousands and millions of training image data frames must be bounded and labeled, one bounding box at a time, by a human using a conventional labeling tool, which is not only time consuming and inefficient, but may also suffer from human errors, inaccuracies, and inconsistencies. Further, two-dimensional images are limited in their accuracy in portraying three-dimensional objects in the respective locations in space. As such, techniques are needed to decrease the time that is needed to label training image data (e.g., over multiple frames and multiple images) as well as to increase the efficiency and accuracy of the labeling itself, thereby increasing both the amount and quality of labeled data used to train the perception component, and ultimately increasing the safety of autonomous operation of a vehicle whose operations and maneuvers are controlled by the trained perception component.