As the number of vehicles on roadways increases each year, transportation agencies are looking for new ways to promote car-pooling, reduce traffic congestion and air pollution through lane management. High Occupancy Vehicle (HOV) lanes are standard car-pool lanes where typically either two or more vehicle occupants are required in order to use the lanes. Similarly, High Occupancy Tolling (HOT) lanes are used where single-occupant vehicles are allowed to use the HOV lane upon payment of a toll to utilize the full capacity of the HOV lanes. If the regulations are strictly enforced, HOV/HOT lanes are typically less congested due to the constraints on the number of vehicle occupants. However, enforcement of the rules of these lanes is currently performed by roadside enforcement officers using visual observation, which is known to be inefficient, costly, and potentially dangerous.
HOV and HOT lanes have been commonly practiced to both reduce traffic congestion and to promote car-pooling. Camera-based methods have been recently proposed for a cost-efficient, safe, and effective HOV/HOT lane enforcement strategy with the prevalence of video cameras in transportation imaging applications. An important step in automated lane enforcement systems is classification of localized window/windshield images to distinguish passenger from no-passenger vehicles to identify violators.
While existing imaging application techniques focus on the vehicle occupancy detection using various face, empty seat, or skin detection, recent techniques use image classification approaches to account for occlusion, typically for the images captures from side-view.
Following the localization of windshields/side-windows in captured images, these methods perform classification of the localized regions using local aggregation-based image features (e.g., fisher vectors) to distinguish passenger from no-passenger vehicles to identify violators. While the fisher vector-based classification accuracy is generally greater than about 95% for front view images, it significantly decreases for the side-view images to about 90% due to the increased within-class variation associated with the side-view images when compared to front-view images.
The use of deep convolutional neural networks (CNNs) has been shown to significantly outperform hand-crafted features in several classification tasks. However, training and/or fine-tuning such CNNs require a set of passenger/no-passenger images manually labeled by an operator, which requires substantial time and effort that can result in excessive operational cost and overhead.
It is well known that the first one or more layers of many deep CNNs learn features similar to Gabor filters and color blobs that appear not to be specific to a particular dataset or task, but in general they are applicable to many datasets and tasks. Features eventually transition from general to specific (i.e., task or domain specific) as the layers get deeper into the network. As such, transferability of a network to a new domain is negatively affected by the specificity of higher layer neurons to their original task at the expense of performance on the target task.
The GoogLeNet proposed by [1] achieved the best results for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). In comparison to other convolutional neural network architectures (e.g., AlexNet, VGG-M, VGG-D, VGG-F, etc.), GoogLeNet utilizes a deep architecture, wherein the work “deep” is defined by both the level of organization as well as in the more direct sense of increased network depth. However, one disadvantage of using a GoogLeNet system is that features from the network are extracted at only deep layers, thereby providing only high-level features.
Most image classification methods require large collections of manually annotated training examples to learn accurate object classification models. The time-consuming human labeling effort effectively limits these types of approaches in many real-world applications, which require the classification models developed in one situation to be quickly deployed to new environments. For example, for vehicle occupancy detection, a classifier (e.g., a support vector machine, a convolutional neural network (CNN), etc.) often must be retrained with images collected from each site to achieve the desired and consistent performance across different sites. However, such systems typically require thousands of images for the training process. The novel deep CNN fusion architecture of the present embodiments is designed to overcome this and other shortcomings, as well as to provide techniques for domain adaptation.
Domain adaptation of statistical classifiers, however, is the problem that arises when the data distribution in the test domain is different from the data distribution in the training domain.
There is a need for enhanced CNN image classification systems that enjoy expedited training and tuning across a plurality of distinct domains and environments.
There is a need for a deep CNN fusion architecture that can be used for domain adaptation when a small set of labeled data is available in the target domain.