This specification relates generally to machine learning and more specifically to systems and methods for training mappings, such as neural networks, via optimization of an indirect encoding to minimize an error metric or maximize an objective function.
Many problems in machine learning, statistics, data science, pattern recognition, and artificial intelligence involve the representation and learning of mappings. Examples of such mappings include, for example: pixel to pixel mappings for image denoising; image to object label mappings for image recognition; language to language word and/or sentence mappings for language translation; mappings of camera inputs to steering direction for self-driving vehicles; and mappings of states of a game to actions required to win the game.
A mapping may be expressed as follows, using the notation x for exemplary inputs to a mapping, y for the outputs of the mapping, and f for the mapping itself:y=f(x)
Mappings can be deterministic or probabilistic, where the latter means that for any given inputs, a mapping may map to several possible output values with corresponding probabilities. A special case of deterministic mappings is functions, which map from single input values to single output values.
Mappings employed in fields such as machine learning, statistics, data science, pattern recognition and artificial intelligence are often defined in terms of a collection of mapping parameters or weights w,Y=f(x,w).
The relationship between a mapping and its given parameters is often (but not always) smooth, such that small changes to the parameters result in a correspondingly small change in the mapping. As a simple example, if the mapping is linear then the parameters may correspond to the slope and intercept parameters of the line. Such parametric mappings of inputs to outputs are especially useful for unknown mappings, which must be learned from example data.
Neural networks are often employed in industry and academia for the representation and learning of mappings from input data. See, e.g., Rumelhart et. al, Learning representations by back-propagating errors, Nature, 323:533-536, 1986 (incorporated by reference herein in its entirety). For example, neural networks are widely used in areas such as: data classification, pattern recognition, segmentation of images to objects, image denoising, motion planning, self-driving vehicles and automated game play, to name a few.
An exemplary neural network mapping is illustrated in FIG. 1. The exemplary neural network 100 may be employed for learning and representing a mapping of a two-dimensional (“2D”) digitally-encoded array of inputs 110 having P×P dimension to a 2D digitally-encoded array of outputs 150 having dimensions Q×Q. The input 110 may be, for example, a pixelated image with P×P pixels, and the output 150 may similarly be a pixelated image with Q×Q pixels. Accordingly, such a neural network may map images to images, and may be useful for image segmentation, detection, classification, de-noising, super-resolution, normalization, and other tasks. Although a 2D input and output array are shown, it will be recognized that neural networks may be employed to map higher dimensional arrays (e.g., voxels in a 3D scan) or lower dimensional arrays (e.g., a 1D array representing possible labels of an image or possible words from a vocabulary of words).
As shown, the mapping 100 comprises three fully-connected layers, including an input layer 110, a hidden layer 130 and an output layer 150. As shown, the input layer 110 is formed by a plurality of input nodes or units 111. For example, the input layer may comprise an array storing certain properties or characteristics of underlying data (e.g., pixel information of a digital image).
The hidden layer 130 is also formed by a plurality of processing units 131. The hidden layer 130 comprises weighted connections (i.e., mapping parameters) 121 from the input layer. For example, mapping parameter 121 connects input layer unit 111 to hidden layer unit 131. When processing the weighted input information from the input layer 110, each unit in the hidden layer 130 computes the data it receives and presents the result to each of the units in the next layer (here, the output layer 150).
The neural network 100 further comprises an output layer 150 that is similarly formed by one or more output nodes 151. The output layer 150 is connected to the hidden layer 130 via a number of mapping parameters 141 from the hidden layer. For example, mapping parameter 141 connects hidden layer unit 131 to output layer unit 151. Typically, this type of architecture will have a separate, independently-tunable mapping parameter (e.g., 121, 141) connecting each unit in one layer to each unit in the subsequent layer.
The mapping of a neural network having a number of layers L can generally be expressed as:y=f(x)=fL(fL-1 . . . (f1(x)))where fl denotes the mapping computed by a given layer, l.
The L layer network may be expressed as follows, where the input x to the mapping f is a vector of inputs, with xi denoting a given element i of the vector; h(l) represents hidden vectors in a given layer l of the mapping; the output y is a vector of outputs of the mapping, with yj representing a given element j of the vector of outputs; and σ1 . . . , σl, . . . , σL represents the transformations in each layer.
            h      r              (        1        )              =                  σ        1            (                        ∑          i                ⁢                              w            ri                          (              1              )                                ⁢                      x            i                              )                  h      k              (        l        )              =                  σ        l            (                        ∑          r                ⁢                              w            kr                          (              l              )                                ⁢                      h            r                          (                              l                -                1                            )                                          )                  y      j        =                  σ        L            (                        ∑          k                ⁢                              w            jk                          (              L              )                                ⁢                      h            k                          (                              L                -                1                            )                                          )      There are many known transformation or activation functions, such as sigmoid logistic, hyperbolic tangent (tan h), rectified linear (ReLU), and other linear or nonlinear functions.
The mapping parameters w correspond to the collection of mapping parameters {w(1), . . . , w(L)} defining the neural network, each being a matrix of mapping parameters for each layer. The parameters are usually represented by numbers specifying the strength of connections between the units in the network. For example, a parameter wij having a value of zero represents that the unit j has no effect on unit i, while a large positive or negative parameter value may represent large effects of units on other units.
The “learning” or “training” of a neural networks refers to altering or changing the parameters in the network, typically with the goal of improving the overall performance of the network. The problem of learning a neural network (i.e., determining the specific parameters to be used) is an example of the more general problem of learning a mapping from data. Given a training data set D comprising a number N of examples of pairs of input and corresponding output observations (i.e., D={(x1, y1) . . . , (xN, yN)}), the goal may be to learn a mapping that approximates the mapping on the training set and, importantly, also generalizes and/or extrapolates well to unseen test data Dtest drawn from the same probability distribution as the pairs in the training data set D.
To learn such a mapping, an error function E is typically defined, where the error function may measure the positive utility (in the case of an objective function) or the negative utility (in the case of a loss function) of a mapping that provides an output y′ from input x when the desired output is y:l(y,y′)
When the error function E is a loss function, the error on a given training data set may be defined for a mapping f(x,w) as the sum of the losses (i.e., empirical loss), as shown in Equation (1) below.E(w,D)=Σn=1Nl(yn,f(xn,w))  (1)
Such an error function can be minimized, for example, by starting from some initial parameter values wo and then taking partial derivatives of E(w,D) with respect to the parameters w and adjusting w in the direction given by these derivatives (e.g., according to the steepest descent optimization algorithm shown in Equation (2), below).
                                                        w              t                        ←                                          w                                  t                  -                  1                                            -                                                η                  t                                ⁢                                                      ∂                                          E                      ⁡                                              (                                                  w                          ,                          D                                                )                                                                                                  ∂                    w                                                                                                          w                      t            -            1                                              (        2        )            
Many variations on this error function E are possible, including versions that include regularization terms that prevent overfitting to the training data, versions of E derived from likelihoods or posteriors of probabilistic models, versions of E that are based on sub-sampling very large data sets (i.e. for applications where the number of data points N is large), or other approximations to the loss function of interest (so called “surrogate loss functions”).
The common components of the above framework are that the goal is to learn a mapping f, that the mapping is parameterized by a number of mapping parameters w, learning occurs based on some example data D, by optimizing some error function E, using some optimization algorithm, which usually employs information about how E changes as a function of w. Any number of optimization algorithms may be employed, including, for example, the use of stochastic gradients, variable adaptive step-sizes, η_t, second-order derivatives, approximations thereof and/or combinations thereof. It will be appreciated that the above framework may be employed to optimize error functions whether such optimization requires minimizing an error value corresponding to a loss function or maximizing an error value corresponding to an objective function. These problems are considered to be equivalent.
One problem associated with training mappings is that the number of independently tunable parameters grows exponentially with the number of dimensions of the input and output arrays. For example, the number of parameters in a 2D×2D mapping of an input array and output array each having K elements per dimension grows by K4. Unfortunately, mappings with even modest numbers of parameters quickly become impractical for pattern recognition and other problems due to exceedingly long training times, high memory requirements, and sample complexity.
Convolutional networks (or convolutional layers) represent one partial solution to this problem. Generally, such networks comprise convolutional layers having a number of nodes that produce an activation by convolving received inputs in accordance with a set of parameters for each unit. Unlike the fully-connected layers of the exemplary neural network described above, each unit in a convolutional layer receives an input from only a portion of the units in the preceding layer. And one or more units in each layer are typically configured to share the same parameters. See, e.g., LeCun et. al, Backpropagation Applied To Handwritten Zip Code Recognition. Neural Computation, 1(4):541-551, 1989 (incorporated by reference herein in its entirety).
By tying mapping parameters together according to translationally invariant convolutional operators, convolutional neural networks offer some reduction in the number of independently-adjustable parameters. Unfortunately, convolutional layers are inflexible and allow only one particular way of tying mapping parameters corresponding to representing only invariance to translation. Convolutional layers are thus not an efficient way of encoding many learning problems.
Another partial solution to this problem is the indirect encoding of a mapping. As discussed above, a mapping f generally processes some input data x according to y=f(x,w). In the case of indirect encodings, a second mapping g (i.e., an “indirect encoding”) is employed to determine the mapping parameters w of the first mapping.
If the mapping parameters w are themselves a function of some intrinsic dimensions of the mapping z, such parameters may be represented as w=g(z,v), where the indirect encoding g itself comprises one or more parameters ν (i.e., “indirect encoding parameters”). The mapping parameters w are said to be indirectly encoded, because they are functions of the indirect encoding (i.e., they are not independently adjustable).
As with the conventional mappings discussed above, an indirectly encoded mapping may be learned by defining an error representing the discrepancy between an output of the mapping for a given input and a correct or expected output (see Equation (1), above). The ideal mapping parameters w may then be learned by optimizing the error (e.g., according to Equation (2), above).
One exemplary indirect encoding technique is Hypercube-based NeuroEvolution of Augmenting Topologies (“HyperNEAT”), an evolutionary algorithm for evolving large-scale neural networks using indirect encodings called Compositional Pattern Producing Networks (“CPPNs”). See, e.g., Stanley et. al, Compositional Pattern Producing Networks: A Novel Abstraction of Development, Genetic Programming and Evolvable Machines, 8(2): 131-162, 2007 (incorporated by reference herein in its entirety); and Stanley et. al, A Hypercube-Based Indirect Encoding For Evolving Large-Scale Neural Networks, Artificial Life, 15(2): 185-212, 2009 (incorporated by reference herein in its entirety). In such indirect encodings, the CPPN generally consists of a small number of different types of neurons encoding different patterns of weights in the neural network. The types and weights of the CPPN neurons are learned using evolutionary algorithms guided by a fitness function.
Although Stanley's evolutionary algorithms allow for a reduction in independently tunable mapping parameters, such algorithms have proved exceedingly inefficient to train. Accordingly, they are impractical for real world applications that involve large data sets and/or complex mappings.
Recently, an extension has been developed for the HyperNEAT neural network, where the neural network and the CPPN are separately optimized using backpropagation. See, e.g., Doolan, Using Indirect Encoding in Neural Networks to Solve Supervised Machine Learning Tasks, Master's thesis, University of Amsterdam, The Netherlands, 2015 (incorporated by reference herein in its entirety). Doolan's method alternates between learning the CPPN and the neural network, but fails to achieve any significant improvements on the state-of-the-art. Indeed, such methods are unacceptable for real world applications, as the proposed alternating procedure for optimization of the HyperNeat error function is not guaranteed to converge.
With the recent advent of large data sets (“big data”), more efficient learning techniques are required. It would be beneficial if indirect encodings could be efficiently trained to encode complex mappings, as such encodings may significantly reduce independently adjustable mapping parameters by taking advantage of, and accurately representing, any geometric structure and/or regularity of encoded mappings.