In a variety of applications, images are analyzed to detect and identify specific types of road signs. For example, the perception system of an autonomous vehicle receives a data stream from a camera onboard the vehicle. A sign detection module of the perception system analyzes the images contained within the data stream and identifies those regions within the images that depict road signs. In this regard, not only is a road sign identified in an image, but the particular type of road sign, such as a speed limit sign, a railroad crossing sign, a school crossing sign or the like, is identified. In addition to detecting the road signs within the images, the location of the road signs may be determined. In this regard, the location at which the image is captured and, as such, the location of a road sign identified within the image may be determined in various manners including, for example, by position information associated with the image, such as may be provided by position data from a global positioning system (GPS). A confidence measure may also be associated with each road sign identified from the images based upon the degree of certainty associated with the identification of the particular road sign. Based upon the detection of the road signs, the autonomous vehicle may be controlled in such a manner as to obey the various road sign.
As another example, a mapping platform may also be configured to review a plurality of images of various roadways and to identify road signs that appear within the images. The identification of the road signs by a mapping platform may be performed in order to confirm the location and the identity of the various road signs that have previously been included within the map, thereby increasing the confidence values associated with these road signs that are included within the map. Additionally or alternatively, the mapping platform may be configured to identify road signs within the images that are not included within the existing map and, as such, may supplement the map to include the newly identified road signs, thereby updating the map, for example, in an instance in which new roads signs have been recently installed.
The performance of the sign detection module of a perception system or a mapping platform as well as the confidence measures associated with the road signs that are identified are typically dependent upon the type of machine learning algorithms that are utilized to train the sign detection module to identify road signs and, more particularly, to identify specific types of road signs. In instances in which the sign detection module has been trained on large volumes of data, however, the performance of the sign detection module may be depend not only upon the nature of the machine learning algorithm, but also upon the quality of the training data.
With respect to the training data, a training data set may be obtained by manually annotating all road signs of interest within a plurality of images and then training the sign detection module to recognize the particular types of road signs within the images of the training data set. Even in instances in which the road signs are accurately annotated, the performance of the sign detection module may be less than is desired for some types of road signs, such as road signs that occur relatively infrequently within the images of the training data set. For instance, a stop sign may be frequently included in the images of a training data set since a stop sign may be placed at nearly every road intersection. However, other types of road signs, such as a road sign indicating that there is to be no through traffic, may be relatively uncommon. This class imbalance in the types of road signs that are recognized leads to a relatively large data set skew with one occurrence of a rare type of road sign for every thousand or more occurrences of the more common types of road signs. This imbalance in the presence of different types of road signs within the images of a training data set may cause the sign detection module to overfit to the more common types of road signs and to underfit or to completely miss the types of road signs that are much less common.
In order to address this imbalance, training data sets may be augmented in an effort to include more or a greater percentage of the types of road signs that are otherwise relatively uncommon. Techniques for augmenting a training data set may include either data folding or data augmentation techniques. With respect to a data folding technique for rebalancing the different types of road signs included within the images of a training data set, reference is made to an example in which the images of a training data set include three images of stop signs for every one image of a road sign indicating that there is to be no through traffic. In order to better train the road sign detection module, the images including the three stop signs may be divided into three partitions. For each partition, all of the images of the road sign indicating that there is to be no through traffic may be utilized and a classifier employed by the sign detection module may be trained such that the different types of road signs are balanced within the training data set that has been subjected to data folding. Since a different classifier of the sign detection module is trained for each of the different partitions, the majority vote of the three classifiers may be utilized in this example to determine the type of road sign identified within an image of the training data set.
Unfortunately, the data folding technique does not permit classifier confidence values to be predicted as a result of the reliance upon a majority vote amongst the classifiers. Instead, the individual confidence values can only be interpreted as an ordering of each classifier and do not represent a global confidence value. Moreover, a sign detection module that utilizes a deep network may take over a week to train on a few million images. This training time may increase appreciably as a result of data folding. In this regard, in order to train a plurality of deep networks simultaneously as required by the data folding technique on the rebalanced data set, many weeks may be required, which may cause the data folding technique to be infeasible.
With respect to the data augmentation technique, random color jittering, weak affine transformations and left-right image flipping are applied to the images of the training data set to synthetically increase the size of the training data set. In some instances, a three-fold increase in training data is obtained as a result of training data augmentation. The data augmentation technique generates the same appearance content as in the original data set. In this regard, the pixel level jittering introduces local dissimilarities between images, but the overall content still largely remains the same. Thus, the data augmentation technique is generally equivalent to adding noise to the underlying data distribution in an effort to synthetically prevent overfitting.