Contemporary machine learning paradigms, such as most deep learning architectures and their governing supervised learning algorithms, consist of a single system that includes a feature representation component with a classifier or regression layers that map features to a desired output space. A convolutional neural network (CNN), for example, hosts multiple layers of convolutional filters, pooling and non-linearities in its lower layers, on top of which a multi-layer perceptron is commonly appended, mapping the top layer features extracted by the convolutional layers to decisions (e.g. classification outputs). Training is achieved in a supervised learning manner, which involves feeding a training dataset consisting of labeled inputs through the network, observing its outputs, defining an error (by measuring the difference between the outputs and the label values), and using techniques such as deep gradient descent and backpropagation to tune the weights of the network across all its layers and nodes such that the error is minimized. In many applications, repeating this process across the many labeled inputs in the training dataset yields a network that can produce correct output when presented with inputs that are different than the ones present in the training dataset. In high-dimensional settings, such as large images, this generalization is achieved only if a sufficiently large and diverse training dataset is made available. In the absence of sufficient number of labeled examples in the training dataset, or lack of diverse examples, the network accuracy often degrades below a desired level of performance.
There are several key limitations to supervised learning, particularly when applied to large parametric models (such as contemporary deep learning architectures). The first is commonly referred to as catastrophic forgetting—a problem that affects artificial neural networks as well as many other supervised learning systems. When trained with novel patterns, neural networks tend to quickly unlearn (or forget) prior representations and their mapping to outputs. Traditionally, neural network training is structured in a way that helps mitigate this effect. When learning in stationary settings, training data is generally shuffled and presented in a random order. The data presentation is stationary in that samples are drawn in an independent and identically distributed manner. If the data is presented in a manner that is non-stationary, the network may not be able to capture and retain representations that are presented at different time intervals. Traditionally, dynamic environments (environments in which labels change and/or new labels are added) have been recognized as tasks that are challenging for neural networks. If the task or environment changes (e.g., new labels are added), a neural network tends to catastrophically forget the previously learned task or environment setting (how to classify inputs using the original labels), as it learns how to classify inputs pertaining to the new labels. In other words, the network weights (parameters) of a machine learning system tune to accommodate the recent statistics of the input data, thereby “forgetting” prior representations.
Catastrophic forgetting renders lifelong learning unachievable for conventional machine learning architectures. Accordingly, to date machine learning systems have been unable to progressively learn new representations while not unlearning prior ones in a principled scalable manner.