The present disclosure relates to a data processing device, a data processing method, and a program, and particularly to a data processing device, a data processing method, and a program which enable an agent that can autonomously perform various actions (autonomous agent) to efficiently perform learning of an unknown environment.
For example, as a learning method in which an agent such as a robot acting in the real world, a virtual character acting in a virtual world, or the like, that can perform actions, performs actions in an unknown environment, there is reinforcement learning through which an agent learns rules of action stage by stage (Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore, “Reinforcement Learning: A Survey”, Journal of Artificial Intelligence Research 4 (1996) 237-285).
In the reinforcement learning, an action value of each action U by an agent performed to reach a state targeted (target state) in a state recognized based on an observation value observed from the outside (environment, or the like) (current state) is calculated (estimated).
When the action values for reaching the target state are calculated, the agent can perform actions for reaching the target state by controlling the actions based on the action values.