For image-related services, client-server architectures are popularly used. For example, an image service can is hosted on a server to provide a client (e.g., a mobile device) access to a collection of images maintained by the server. The client can search, download, and use images available from the server.
Generally, existing image services extract features from images, where these features support various operations related to the images, some of which are made accessible to clients. Many of the existing image services implement a feature-learning framework for learning the features. Feature-learning involves using a feature-embedding function to generate compact and representative feature vectors. A feature vector includes values representing different visual content from an image. For example, in an image of a tree, a feature vector include numerical values representing a trunk, branch, or leaf of the tree. The feature vector is subsequently used for operations such as, for example, to classify the image. Hence, the image is classified in a “tree” category.
A commonly used feature-learning framework implements a convolutional neural network. The convolutional neural network includes multiple layers for processing images and, accordingly, generating feature vectors. The accuracy of learning features depends on how well the convolutional neural network is trained. Generally, training is driven by a training dataset and a training task. The training dataset is labeled. The training task is defined by the labels and by a type of cost function (e.g., a regression loss, a classification loss, or other loss-based functions).
Current solutions for training a convolutional network for a specific task involve a single training dataset having labels specific to that training dataset. Many reasons exist for this training approach. In one example, layers of the convolutional neural network are set to learn features according to the training task. In another example, the computational burden associated with the training may require this approach. Generally, the more advanced the training (e.g., the larger the training dataset), the higher the computational burden becomes.
Once a convolutional neural network is trained, the trained network can be used to learn features from an image. The feature learning can be highly accurate if the image falls within the boundary of the training (e.g., the labels used for the training can be properly applied to the image). Otherwise, the robustness of the convolutional neural network suffers. Specifically, in many real-world scenarios, a single training dataset and training task can be insufficient for learning robust and generalizable feature representation due to, for example, label noise, imbalance of label distribution, and shift in data distribution within the training dataset. The impact to the robustness and, thus, to the accuracy of learning feature can reduce the quality of service provided to clients.
To illustrate this impact to the robustness and accuracy, consider an example of image classification based on gender. In this illustrative example, known gender images are used to train a convolutional neural network for gender classification. Thus, this network can accurately generate features for classifying people based on gender. With respect to gender-based classification, an image service that relies on the convolutional neural network to classify images can perform well. However, using the convolutional neural network with features for other classifications (e.g., for age-based classification) may provide less accurate results. The quality of service can similarly degrade for the other classifications.