Broadly speaking, reinforcement learning differs from supervised learning in that correct input-output pairs are not presented but instead a machine (software agent) learns to take actions in some environment to maximise some form of reward or minimise a cost. Taking an action moves the environment/system from one state to another and in the particular case of Q-learning the Quality of a state-action combination is calculated, this describing an action-value function which can be used to determine the expected utility of an action. The Q-learning algorithm is described in “Q-learning”, Machine learning, vol 8, pages 279-292, 1992, Watkins, Christopher J C H and Dayan, Peter and conveniently summarised, for example, on Wikipedia™.
Nonetheless learning to control software agents directly from high-dimensional sensory inputs such as vision and speech is one of the long-standing challenges of reinforcement learning (RL). Most successful RL applications that operate in these domains have relied on hand crafted features combined with linear policy functions, and the performance of such system relies heavily on the quality of the feature representation. On the other hand, learning representations of sensory data has been the focus of deep learning methods, most of which have relied on large supervised training sets applied to deep convolutional neural networks.
Perhaps the best known success of reinforcement learning using a neural network is TD-Gammon “Temporal difference learning and TD-Gammon”, Communications of the ACM, vol 38(3), pages 58-68, Tesauro, Gerald. This was a backgammon-playing program which learnt by reinforcement learning and self-play and achieved a super-human level of play. However this approach employed human-engineered features and a state value function independent of actions (a total score), rather than an action-value function. Moreover it did not accept a visual input.
Early attempts to follow up on TD-Gammon were relatively unsuccessful—the method did not work well for chess, go and checkers. This lead to a widespread belief that TD-Gammon was a special case, and that the neural network could only approximate the value function in backgammon because it is very smooth, due to stochasticity in the dice rolls.
It was also shown that combining model-free reinforcement learning algorithms such as Q-learning with non-linear function approximators such as a neural network could cause the Q-network to diverge. Thus subsequent work focussed on linear function approximators with better convergence guarantees. In addition to concerns about the convergence, it is also unclear whether the training signal provided by reinforcement learning is sufficient for training large neural networks. Thus while many successful applications of convolutional neural networks benefit from using a large set of labelled training examples (supervised learning), the reward signal provided by RL is often delayed, sparse and noisy.
There has, nonetheless, been an attempt to use a multilayer perceptron to approximate a Q-value function, in “Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method”, Machine Learning: ECML 2005, Springer 2005, pages 317-328, Riedmiller, Martin. The technique described there is based on the principle of storing and reusing transition experiences but has some significant practical disadvantages: broadly speaking a neural network is trained based on the stored experience, and when the experience is updated with a new (initial state—action—resulting state) triple the previous neural network is discarded and an entirely new neural network is trained on the updated experience. This is because otherwise the unsupervised training could easily result in divergent behaviour. However a consequence is that there is a computational cost per update that is proportional to the size of the data set, which makes it impractical to scale this approach to large data sets. The same approach has been applied to visual input preprocessed by an autoencoder, but this suffers from substantially the same problem (“Deep Auto-Encoder Neural Networks in Reinforcement Learning”, Sascha Lange and Martin Riedmiller).
There is therefore a need for improved techniques for reinforcement learning, in particular when neural networks are employed.