This specification relates to reinforcement learning systems.
In a reinforcement learning system, an agent interacts with an environment by receiving an observation that either fully or partially characterizes the current state of the environment, and in response, performing an action selected from a predetermined set of actions. The reinforcement learning system receives rewards from the environment in response to the agent performing actions and selects the action to be performed by the agent in response to receiving a given observation in accordance with an output of a value function representation. The value function representation takes as an input an observation and an action and outputs a numerical value that is an estimate of the expected rewards resulting from the agent performing the action in response to the observation.
Some reinforcement learning systems use a neural network to represent the value function. That is, the system uses a neural network that is configured to receive an observation and an action and to process the observation and the action to generate a value function estimate.
Neural networks are machine learning models that employ one or more layers of nonlinear units to generate an output for a received input. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
Some other reinforcement learning systems use a tabular representation of the value function. That is, the system maintains a table or other data structure that maps combinations of observations and actions to value function estimates for the observation-action combinations.