1. Field of the Invention
The present invention relates to actor-critic fuzzy reinforcement learning (ACFRL), and particularly to a system controlled by a convergent ACFRL methodology.
2. Discussion of the Related Art
Reinforcement learning techniques provide powerful methodologies for learning through interactions with the environment. Earlier, in ARIC (Berenji, 1992) and GARIC (Berenji and Khedkar, 1992), fuzzy set theory was used to generalize the experience obtained through reinforcement learning between similar states of the environment. In recent years, we have extended Fuzzy Reinforcement Learning (FRL) for use in a team of heterogeneous intelligent agents who collaborate with each other (Berenji and Vengerov, 1999, 2000). It is desired to have a fuzzy system that is tunable or capable of learning from experience, such that as it learns, its actions, which are based on the content of its tunable fuzzy rulebase, approach an optimal policy.
The use of policy gradient in reinforcement learning was first introduced by Williams (1992) in his actor-only REINFORCE algorithm. The algorithm finds an unbiased estimate of the gradient without assistance of a learned value function. As a result, REINFORCE learns much slower than RL methods relying on the value function, and has received relatively little attention. Recently, Baxter and Barlett (2000) extended the REINFORCE algorithm to partially observable Markov decision processes (POMDPs). However, learning a value function and using it to reduce the variance of the gradient estimate appears to be the key to successful practical applications of reinforcement learning.
The closest theoretical result to this invention is the one by Sutton et al. (2000). That work derives exactly the same expression for the policy gradient with function approximation as the one used by Konda and Tsitsiklis. However, the parameter updating algorithm proposed by Sutton et al. based on this expression is not practical: it requires estimation of the steady state probabilities under the policy corresponding to each iteration of the algorithm as well as finding a solution to a nonlinear programming problem for determining the new values of the actor's parameters. Another similar result is the VAPS family of methods by Baird and Moore (1999). However, VAPS methods optimize a measure combining the policy performance with accuracy of the value function approximation. As a result, VAPS methods converge to a locally optimal policy only when no weight is put on value function accuracy, in which case VAPS degenerates to actor-only methods.