Machine learning is a subfield of computer science that gives computers an ability to recognize certain patterns of data without being explicitly programmed to do so. Machine learning algorithms typically operate by building a computational model to recognize patterns of data by training the model with a set of example patterns collected from real-world efforts. However, in certain machine learning applications, it can be difficult or infeasible to collect a sufficient number and/or variation of high-quality real-world training examples to adequately train a machine learning system. In these situations, machine-learning systems can be trained with synthetic computer-generated examples. Synthetic examples are often inadequate, however. Almost always, there is a significant difference between an actual real-world data example and a synthetic computer-generated example, and that difference can be important for training a machine learning system.
Object recognition, as described in this invention, is the act of recognizing given objects in an image. Part of the object recognition task is classifying objects as particular types. Therefore, “classification,” as it relates to machine learning, is described here in more detail as follows.
“Classification” is a task in the field of machine learning where unknown or unclassified objects or images of objects are grouped into collections of known or classified objects or images of objects. The known collections are called “classes.” Each class is denoted by the term cn, where cn⊆C (C is the set of all classes cn) and each cn has a set of features fm where fm⊆F. Given C, there should be enough distinction between c0, c1, . . . cn such that a set of lines exists that can divide c0, c1, . . . cn from each other. This quality of distinction is called linear separability. Classes that are linear separable are those that can be separated by straight lines. Classes that are not linear separable are those that cannot be separated by straight lines. Instead, classes in C can be separated by non-linear (e.g., curved) boundaries. FIG. 17 and FIG. 18 illustrates the difference between classes that are linear separable (FIG. 17) and classes that are not linear separable (FIG. 17). Both FIG. 17 and FIG. 18 are further described below.
A common way to identify or classify data is to use a machine learning technique. In machine learning, computers are used to classify, predict or infer new facts based on an internal representation. This representation is quite often based on training a machine learning algorithm using existing data that bears some similarity to the data that is unknown.
Training a classifier (e.g., a machine learning system that will classify or recognize objects) entails defining an appropriate machine learning algorithm, defining that algorithm's parameters, and establishing a data set that will best represent the space of objects to be classified. For example, if a goal is to classify types of pets, the training data should contain a sufficient number of examples of pets that will need to be classified. A classifier that is missing examples of fish but has examples of dogs and cats, may not be able to sufficiently classify fish.
Given a data set of images, to use these images for a given task, for example, the task of classification, the quality of the classification is a function of the quantity of each type of object in the image set used to train the image classifier.
The quality of the data is also important. For example, if the goal of the classifier is to classify images of pets, if the set of images contains image samples for a particular type that is not clean, then the classifier may not accurately classify unseen pets of that type. An unclean image might include characteristics such as the following:                Noisy Backgrounds—Lots of other information in the image that clutters the image;        Obfuscation—The actual object is obfuscated in some way (e.g., it is in a shadow);        Distortion—The actual object is distorted in some way; or        Bad Perspective—The perspective of the actual object is not consistent with samples to be classified.        
Therefore, given a data set of images, to use these images for a given task, for example, the task of classification, the quality of the classification is a function of quantity of each type of object and the overall quality of the image set used to train the image classifier.
Given a data set that is not sufficiently sized or contains training samples that underrepresent the actual objects that need to be classified, then different measures may be taken to overcome these problems:                1. De-noise or declutter the image.        2. Extract just the main object from the image and create new images containing just that object, for example, a dog.        3. Create duplicates of images in the training data set that are considered ‘good’ representatives of the types of objects that will be classified.        4. Take these ‘good’ representatives and change them just enough to make them different from the original.        5. Use data from other data sources to supplement the training data set.        6. Take images from a different data set that are perhaps similar in some way and make them look like the images in the training data set.        
Exploring the use of ‘good’ candidate images and creating duplicates that are transformed enough to call them different can produce transformed images that can be used to supplement the data set. Transformations can include applying simple mathematical operations to the images, histogram modifications, interpolation, rotations, background removal, and more.
Such transformations can improve classification under certain circumstances but performance gains are often minimal.
More importantly, data acquisition typically implies more cost, as data acquisition and data preparation can be costly. Synthetic approaches and data transformations also result in higher cost with usually a lower payoff as these methods are inferior to the true data samples. Adapting other data sets to mimic the data required to train the machine learning method, again implies high costs, as this process involves some mechanism to create the synthetic examples.
A key ingredient for deep learning is the data. Deep learning algorithms tend to require more data than the average machine learning algorithm in order to learn data representations and features.
Machine Translation
Machine translation (“MT”) [see endnote 2] is part of Computational Linguistics, a subset of Natural Language Processing, and it implies a machine is assisting with the translation of either text or speech from one language to another. MT can involve simple word substitution and phrase substitution. Typically, statistical methods are used to perform MT. In certain embodiments, we apply machine translation to images by means of deep model translation.
Autoencoders
An autoencoder [see endnote 3] is an unsupervised neural network that closely resembles a feedforward non-recurrent neural network with its output layer having the same number of nodes as the input layer. Within the autoencoder, the dimensionality of the data is reduced to a size much smaller than the original dimensions. This reduction is often called a “bottleneck.” The encoder flattens or compresses the data into this smaller bottleneck representation. The decoder then tries to recreate the original input from this compressed representation, producing a representation that is equal to the size of the input and similar to the original input. The better the performance of the autoencoder, the closer the recreated output is to the original input.
Formally, within an autoencoder, a function maps input data to a hidden representation using a non-linear activation function. This is known as the encoding:z=ƒ(x)=sƒ(Wx+bz),
where the function ƒ maps input x to a hidden representation z, sƒ is a non-linear activation function, and W and b represent the weights and bias.
A second function may be used to map the hidden representation to a reconstruction of the expected output. This is known as the decoding:y=g(z)=sg(W′z+by),
where g maps hidden representation z to a reconstruction of y, sg is a non-linear activation function, and W and b represent the weights and bias.
In order for the network to improve over time, it minimizes an objective function that tries to minimize the negative log-likelihood:AE(θ)=Σx∈DnL(x,g(ƒ(x))),
where L is the negative log-likelihood and x is the input.
There are different variants of autoencoders: from fully connected to convolutional. With fully connected autoencoders, neurons contained in a particular layer are connected to each neuron in the previous layer. (A “neuron” in an artificial neural network is a mathematical approximation of a biological neuron. It receives a vector of inputs, performs a transformation on them, and outputs a single scalar value.) With convolutional layers, the connectivity of neurons is localized to a few nearby neurons in the previous layer. For image based tasks convolutional autoencoders are the standard. In embodiments of this invention, when autoencoders are referenced, it is implied that the convolutional variant may be used.
Generative Adversarial Networks (“GANS”)
A generative adversarial network (“GAN”) [see endnote 1] is a network made of two deep networks. The two networks can be fully connected where each neuron in layer l is connected to every neuron in layer l−1, or can include convolutional layers, where each neuron in layer l is connected to a few neurons in layer l−1. The GANs used in embodiments of the invention encompass a combination of fully connected layers and convolutional layers. One of the networks is typically called the discriminative network and the other is typically called the generative network. The discriminative network has knowledge of the training examples. The generative network does not, and tries to ‘generate new samples,’ typically beginning from noise. The generated samples are fed to the discriminative network for evaluation. The discriminative network provides an error measure to the generative network to convey how ‘good’ or ‘bad’ the generated samples are, as they relate to the data distribution generated from the training set.
Formally, a generative adversarial network defines a model G and a model D. Model D distinguishes between samples from G and samples h from its own distribution. Model G takes random noise, defined by z, as input and produces a sample h. The input received by D can be from h or h. Model D produces a probability indicating whether the sample is input that fits into the distribution or not.
Variations of the following objective function are used to train both types of networks:
            min      G        ⁢                  max        D            ⁢                        𝔼                      h            ∼                          p                              Data                ⁡                                  (                  h                  )                                                                    ⁡                  [                      log            ⁢                                                  ⁢                          D              ⁡                              (                h                )                                              ]                      +            𝔼              z        ∼                  p                      Noise            ⁡                          (              z              )                                            ⁢          log      ⁡              (                  1          -                      D            ⁡                          (                              G                ⁡                                  (                  z                  )                                            )                                      )            