This specification relates to training a reinforcement learning system.
Reinforcement learning agents interact with an environment by receiving an observation that characterizes the current state of the environment, and in response, performing an action from a predetermined set of actions. Reinforcement learning agents generally receive rewards in response to performing the actions and select the action to be performed in response to receiving a given observation in accordance with an output of a value function. Some reinforcement learning agents use a neural network in place of a value function, e.g., to approximate the outcome of the value function by processing the observation using the neural network and selecting an action based on the output of the neural network.
Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.