The present invention relates generally to an algorithm which is a general learning controller. More particularly, it is an algorithm which stores two functions, and updates them on the basis of reinforcement received from the environment.
The invention has potential uses in aircraft (such as for flight control), vehicles, robots, and manufacturing automation. One of the most general problems in control theory is the problem of creating an optimal controller for a nonlinear, stochastic, poorly modeled system.
The following United States patents are of interest.
U.S. Pat. No. 5,257,343--Kyuma et al
U.S. Pat. No. 5,250,886--Yasuhara et al
None of the above patents disclose an algorithm for reinforcement learning requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. The patent to Kyuma et al discloses an intelligence information system composed of an associative memory and a serial processing computer. The patent to Yasuhara et al discloses a method of storing teaching points of a robot. When teaching points for a plurality of moving units are input, information for identifying the moving units associated with the teaching points is input, and the teaching points and the identification data are stored in a single area of a memory.
References
Baird, L. C. (1992). Function minimization for dynamic programming using connectionist networks. Proceedings of the IEEE Conference on Systems, Man, and Cybernetics (pp. 19-24). Chicago, Ill. PA0 Baird, L. C., & Klopf, A. H. (1993a). A hierarchical network of provably optimal learning control systems: Extensions of the associative control process (ACP) network. Adaptive Behavior, 1(3), 321-352. PA0 Baird, L. C., & Klopf, A. H. (1993b). Reinforcement Learning with High-Dimensional, Continuous Actions. To appear as a United States Air Force technical report. PA0 Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Englewood Cliffs, N.J.: Prentice-Hall. PA0 Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. Proceedings of the Fifth Conference on Neural Information Processing Systems (pp. 295-302). Morgan Kaufmann. PA0 Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671-692 PA0 Jaakkola, T., Jordan, M. I., & Singh, S. P. (1993). On the Convergence of Stochastic Iterative Dynamic Programming Algorithms (Tech. Rep. 9307). Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Mass. PA0 Jacobson, D. H., & Mayne, D. Q. (1970). Differential Dynamic Programming. New York: American Elsevier Publishing Company. PA0 Klopf, A. H., Morgan, J. S., & Weaver, S. E. (1993). A hierarchical network of control systems that learn: Modeling nervous system function during classical and instrumental conditioning. Adaptive Behavior, 1(3), 263-319. PA0 Nguyen, D. H., & Widrow, B. (1990). Neural networks for self-learning control systems. IEEE Control Systems Magazine, (April), 18-23. PA0 Ross, S. (1983). Introduction to Stochastic Dynamic Programming. New York: Academic Press. PA0 Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. Proceedings of the Tenth International Conference on Machine Learning (pp. 298-305). Amherst, Mass. PA0 Sutton, R. S. (1990a). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning. PA0 Sutton, R. S. (1990b). Talk on a new performance measure for reinforcement learning, presented at GTE laboratories, Waltham, Mass., 11 September. PA0 Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4), 257-277. PA0 Watkins, C. J. C. H. (1989). Learning from delayed rewards. Doctoral thesis, Cambridge University, Cambridge, England. PA0 Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3/4), 279-292. PA0 White, D. A., & Sofge, D. A. (1990). Neural network based process optimization and control. Proceedings of the 29th Conference on Decision and Control. (pp. 3270-3276), Honolulu, Hi. PA0 White, D. A., & Sofge, D. A. (Eds.). (1992). Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. New York: Van Nostrand Reinhold. PA0 Williams, R. J., & Baird, L. C. (1990). A mathematical analysis of actor-critic architectures for learning optimal control through incremental dynamic programming. Proceedings of the Sixth Yale Workshop on Adaptive and Learning Systems (pp. 96-101). New Haven, Conn. PA0 Williams, R. J., & Baird, L. C. (1993). Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Critic Learning Systems. (Tech. Rep. NU-CCS-93-11). Boston, Mass.: Northeastern University, College of Computer Science.