The following relates to camera-based survey systems to monitor road usage or the like, quality control inspection systems, and other computer vision tasks; and to image classification, processing and archiving arts, and related arts.
Cameras, both still and video, are increasingly being deployed for tasks such as: assessing road usage by vehicle type, in-state versus out-of-state traffic; machine vision assembly line tasks such as quality control inspection/defect detection; document classification; and the like. For example, in road monitoring, vehicles are imaged and classified by vehicle type (e.g. commercial truck, passenger car, or so forth), by license plate type (e.g. in-state plate versus out-of-state plate), or so forth. In quality control inspection, a part moving along an assembly line may be imaged and classified as defective or not defective based on the image. In document classification, an incoming filled-out form may be imaged using a document scanner and the resulting page image classified as to the type of form.
In each of these applications, the image is classified using an empirically trained image classifier. This task is challenging, because the image being classified may differ from the training images used to train the classifier due to differences in lighting, subject position or orientation, subject-to-camera distance, or so forth. Some of these differences can be compensated prior to input to the image classifier, for example by applying a spatial registration algorithm or rescaling the image—but such operations can also undesirably distort the image.
A known image classification architecture is the bag-of-patches (BoP) pipeline. In this approach, local descriptors are extracted from an image and encoded and aggregated to form an image feature vector. The feature vector is input to a kernel classifier such as a Support Vector Machine (SVM) classifier to generate the image classification. Encoding the features by computing higher-order statistics, such as the Fisher Vector (FV) encoding, has been found to provide good image classification performance in conjunction with a linear kernel classifier. The training phase is computationally efficient as the feature extraction and encoding is unsupervised. Only the kernel classifier is trained on a set of labeled training images, and this can be formulated as a convex optimization that is insensitive to the parameter initialization.
More recently, convolutional neural network (CNN) architectures have been shown to outperform BoP pipelines for image classification tasks. In the image classification context, these CNN image classifiers operate directly on the image, rather than on a feature vector extracted from the image. The neurons of the CNN are arranged to operate on overlapping spatial regions of the image, i.e. on overlapping CNN spatial receptive fields. CNNs are feed-forward architectures involving multiple computational layers that alternate linear operations such as convolutions or average-pooling and non-linear operations such as max-pooling and sigmoid activations. Advances in graphical processing unit (GPU) systems, primarily for use in video gaming, has helped drive the development of CNN image classifiers.