1. Field of the Invention
This invention generally pertains to systems having states, and in particular to methods for determining a sequence of actions for such systems.
2. Discussion of the Related Art
A generalized method and arrangement for determining a sequence of actions for a system having states, wherein a transition in state between two states is performed on the basis of an action, is discussed by Neuneier in “Enhancing Q-Learning for Optimal Asset Allocation”, appearing in the Proceedings of the Neural Information Processing Systems, NIPS 1997. Neuneier describes a financial market as an example of a system which has states. His system is described as a Markov Decision Problem (MDP).
The characteristics of a Markov Decision Problem are represented below by way of summary:
X set of possible states of the system, e.g. X=m,
A(xt) set of possible actions in the state
p(xt+1|xt,at) xt 
r(xt, at, xt+1) gain with expectation R(xt, at).
Starting from observable variables, the variables denoted below as training data, the aim is to determine a strategy, that is to say a sequence of functionsπ={μ0,μ1, K, μT},  (3)which at each instant t map each state into an action rule, that is to say actionμt(xt)=at  (4)
Such a strategy is evaluated by an optimization function.
The optimization function specifies the expectation, the gains accumulated over time at a given strategy π, and a start state x0.
The so-called Q-learning method is described by Neuneier as an example of a method of approximative dynamic programming.
An optimum evaluation function V*(x) is defined byV*(x)=πmax Vπ(x)∀x εX   (5)where
                                                        V              π                        ⁡                          (              x              )                                =                      E            ⁡                          [                                                                                          ∑                                              t                        =                        0                                            ∞                                        ⁢                                                                                  ⁢                                                                  γ                        t                                            ⁢                                              r                        ⁡                                                  (                                                                                    x                              t                                                        ,                                                          μ                              t                                                        ,                                                                                          x                                t                                                            +                              1                                                                                )                                                                                                      |                                      x                    0                                                  =                x                            ]                                      ,                            (        6        )            γ denoting a prescribable reduction factor which is formed in accordance with the following rule:
                              γ          =                      1                          1              +              z                                      ,                            (        7        )            z∈ρ+.  (8)
A Q-evaluation function Q*(xt,at) is formed within the Q-learning method for each pair (state xt, action at) in accordance with the following rule:
                                                                                          Q                  *                                ⁡                                  (                                                            x                      t                                        ,                                          a                      t                                                        )                                            :=                            ⁢                                                                    ∑                                          x                      ⁢                                                                                          ⁢                      ε                      ⁢                                                                                          ⁢                      X                                                        ⁢                                                                          ⁢                                                            p                      ⁡                                              (                                                                                                            x                                                              t                                +                                1                                                                                      |                                                          x                              t                                                                                ,                                                      a                            t                                                                          )                                                              ·                                          r                      t                                                                      +                                                                                                      ⁢                                                +                  γ                                ·                                                      ∑                                          x                      ⁢                                                                                          ⁢                      ε                      ⁢                                                                                          ⁢                      X                                                        ⁢                                                                          ⁢                                                            p                      ⁡                                              (                                                                              x                            |                                                          x                              t                                                                                ,                                                      a                            t                                                                          )                                                              ·                                                                  max                                                  a                          ⁢                                                                                                          ⁢                          ε                          ⁢                                                                                                          ⁢                          A                                                                    ⁢                                                                                          ⁢                                              (                                                                              Q                            *                                                    ⁡                                                      (                                                          x                              ,                              a                                                        )                                                                          )                                                                                                                                                    (        9        )            
On the basis respectively of the tupel (xt, xt+1, at, rt), the Q-values Q* (x,a) are adapted in the k+1 th iteration in accordance with the following learning rule with a prescribed learning rate ηk in accordance with the following rule:
                                          Q                          k              +              1                                ⁡                      (                                          x                t                            ,                              a                t                                      )                          =                                            (                              1                -                                  η                  k                                            )                        ⁢                                          Q                k                            ⁡                              (                                                      x                    t                                    ,                                      a                    t                                                  )                                              +                                                    η                k                            ⁡                              (                                                      r                    t                                    +                                      γ                    ⁢                                                                                  ⁢                                                                  max                                                  a                          ⁢                                                                                                          ⁢                          ε                          ⁢                                                                                                          ⁢                          A                                                                    ⁢                                                                                          ⁢                                              (                                                                              Q                            k                                                    ⁡                                                      (                                                                                          x                                                                  t                                  +                                  1                                                                                            ,                              a                                                        )                                                                          )                                                                                            )                                      .                                              (        10        )            
Usually, the so-called Q-values Q*(x,a) are approximated for various actions by a function approximator in each case, for example a neural network or a polynomial classifier, with a weighting vector wa, which contains weights of the function approximator.
A function approximator is, for example, a neural network, a polynomial classifier or a combination of a neural network with a polynomial classifier.
It therefore holds that:Q*(x, a)≈Q(x; wa).   (11)
Changes in the weights in the weighting vector wa are based on a temporal difference dt which is formed in accordance with the following rule:
                              d          t                :=                              r            ⁡                          (                                                x                  t                                ,                                  a                  t                                ,                                  x                                      t                    +                    1                                                              )                                +                      γ            ⁢                                                  ⁢                                          max                                  a                  ⁢                                                                          ⁢                  ε                  ⁢                                                                          ⁢                  A                                            ⁢                                                          ⁢                              Q                ⁡                                  (                                                            x                                              t                        +                        1                                                              ;                                          w                      k                      a                                                        )                                                              -                      Q            ⁡                          (                                                x                  t                                ;                                  w                  k                                      a                    t                                                              )                                                          (        12        )            
The following adaptation rule for the weights of the neural network, which are included in the weighting vector wa, follows for the Q-learning method with the use of a neural network:
                              w                      k            +            1                                a            t                          =                              w            k                          a              t                                +                                    η              k                        ·                          d              t                        ·                                          ∇                                  Q                  ⁡                                      (                                                                  x                        t                                            ;                                              w                        k                                                  a                          t                                                                                      )                                                              .                                                          (        13        )            
The neural network representing the system of a financial market as described by Neuneier is trained using the training data which describe information on changes in prices on a financial market as time series values.
A further method of approximative dynamic programming is the so-called TD(λ) learning method. This method is discussed in R.S. Sutton's, “Learning To Predict By The Method Of Temporal Differences”, appearing in Machine Learning, Chapter 3, pages 9–44, 1988.
Furthermore, it is known from M. Heger's, “Risk and Reinforcement Learning: Concepts and Dynamic Programming”, ZKW Bericht No. Aug. 1994, Zentrum für Kognitionswissenschaften [Center for Cognitive Sciences], Bremen University, December 1994, that risk is associated with a strategy π and an initial state xt. A method for risk avoidance is also discussed by Hager, cited above.
The following optimization function, which is also referred to as an expanded Q-function Qπ(xt, at), is used in the Heger method:
                    (                                                            Q                _                            π                        ⁡                          (                                                x                  t                                ,                                  a                  t                                            )                                :=                                    r              ⁡                              (                                                      x                    t                                    ,                                      a                    t                                    ,                                      x                                          t                      +                      1                                                                      )                                      +                                          inf                                                                            x                      0                                        ,                                          x                      1                                        ,                    K                                                                              p                      ⁡                                              (                                                                              x                            0                                                    ,                                                      x                            1                                                    ,                          K                                                )                                                              >                    0                                                              ⁢                              {                                                      ∑                                          k                      =                      1                                        ∞                                    ⁢                                                                          ⁢                                                            γ                      k                                        ⁢                                          r                      ⁡                                              (                                                                              x                            k                                                    ,                                                      π                            ⁡                                                          (                                                              x                                k                                                            )                                                                                ,                                                      x                                                          k                              +                              1                                                                                                      )                                                                                            }                                                    )                            (        14        )            
The expanded Q-function Qπ(xt, at) describes the worst case if the action at is executed in the state xt and the strategy π is followed thereupon.
The optimization function Qπ(xt, at) for
                                                        Q              _                        *                    ⁢                                          ⁢                      (                                          x                t                            ,                              a                t                                      )                          :=                              max                          π              ⁢                                                          ∈              ∏                                ⁢                                          ⁢                                                    Q                _                            π                        ⁢                                                  ⁢                          (                                                x                  t                                ,                                  a                  t                                            )                                                          (        15        )            is given by the following rule:
                                                        Q              _                        *                    ⁢                                          ⁢                      (                                          x                t                            ,                              a                t                                      )                          =                                  ⁢                              min                                          x                ⁢                                                                  ⁢                ε                ⁢                                                                  ⁢                X                                                              p                  ⁡                                      (                                                                                            x                                                      t                            +                            1                                                                          |                                                  x                          t                                                                    ,                                              a                        t                                                              )                                                  >                0                                              ⁢                                    (                                                r                  ⁡                                      (                                                                  x                        t                                            ,                                              a                        t                                            ,                      x                                        )                                                  +                                  γ                  ·                                                            max                                              a                        ⁢                                                                                                  ⁢                        ε                        ⁢                                                                                                  ⁢                        A                                                              ⁢                                                                                            Q                          _                                                *                                            ⁡                                              (                                                  x                          ,                          a                                                )                                                                                                        )                        .                                              (        16        )            
A substantial disadvantage of this mode of procedure is that only the worst case is taken into account when finding the strategy. However, this inadequately reflects the requirements of the most varied technical systems.
In “Dynamic Programming and Optimal Control”, Athena Scientific, Belmont, Mass., 1995, D.P. Bertsekas formulates access control for a communications network and routing within the communications network as a problem of dynamic programming.
Therefore, the present invention is based on the problem of specifying a method and system for determining a sequence of actions in which the method or sequences of actions achieve an increased flexibility in determining the strategy needed.
In a method for computer-aided determination of a sequence of actions for a system which has states, a transition in state between two states being performed on the basis of an action, the determination of the sequence of actions is performed in such a way that a sequence of states resulting from the sequence of actions is optimized with regard to a prescribed optimization function, the optimization function including a variable parameter with the aid of which it is possible to set a risk which the resulting sequence of states has with respect to a prescribed state of the system.
A system for determining a sequence of actions for a system which has states, a transition in state between two states being performed on the basis of an action, has a processor which is set up in such a way that the determination of the sequence of actions can be performed in such a way that a sequence of states resulting from the sequence of actions is optimized with regard to a prescribed optimization function, the optimization function including a variable parameter with the aid of which it is possible to set a risk which the resulting sequence of states has with respect to a prescribed state of the system.
Thus, the present invention offers a method for determining a sequence of actions at a freely prescribable level of accuracy when finding a strategy for a possible closed-loop control or open-loop control of the system, in general for influencing it. Hence, the embodiments described below are valid both for the method and for the system.
Approximative dynamic programming is used for the purpose of determination, for example a method based on Q-learning or a method based on TD(λ)-learning.
Within Q-learning, the optimization function OFQ is preferably formed in accordance with the following rule:OFQ=Q(x; wa),x denoting a state in a state space Xa denoting an action from an action space A, andwa denoting the weights of a function approximator which belong to the action a.
The following adaptation step is executed during Q-learning in order to determine the optimum weights wa of the function approximator:
            w              t        +        1                    a        t              =                  w        t                  a          t                    +                                    η            t                    ·                                  κ                          ⁢                              (                          d              t                        )                    ·                      ∇                          Q              ⁡                              (                                                      x                    t                                    ;                                      w                    t                                          a                      t                                                                      )                                                          ⁢        with the abbreviation
      d    t    =            r      ⁡              (                              x            t                    ,                      a            t                    ,                      x                          t              +              1                                      )              +          γ      ⁢                          ⁢                        max                      a            ⁢                                                  ⁢            ε            ⁢                                                  ⁢            A                          ⁢                                  ⁢                  Q          ⁡                      (                                          x                                  t                  +                  1                                            ,                              w                t                a                                      )                                -          Q      ⁡              (                              x            t                    ,                      w            t                          a              t                                      )            xt, xt+1 respectively denoting a state in the state space X,at denoting an action from an action space A,γ denoting a prescribable reduction factor,wtat denoting the weighting vector associated with the action at before the adaptation step,wt+1at denoting the weighing vector associated with the action at after the adaptation step,ηt(t=1, . . . ) denoting a prescribable step size sequence,κε[−1; 1] denoting a risk monitoring parameter,κ denoting a risk monitoring function κ (ξ)=(1−κsign(ξ))ξ,∇Q(;) denoting the derivation of the function approximator according to its weights, andr(xt, at, xt+1) denoting a gain upon the transition of state from the state xt to the subsequent state xt+1.
The optimization function is preferably formed in accordance with the following rule within the TD(λ)-learning method:
OFTD=J(x;w)
x denoting a state in a state space X,
a denoting an action from an action space A, and
w denoting the weights of a function approximator.
The following adaptation step is executed during TD(λ)-learning in order to determine the optimum weights w of the function approximator:wt+1=wt+ηt·κ(dt)·zt with the abbreviationsdt=r(wt, at, xt+1)+γJ(xt+1; wt)−J(xt; wt),zt=λ·γ·zt−1+∇J(xt; wt),z1=0xt, xt+1 respectively denoting a state in the state space X,at denoting an action from an action space A,γ denoting a prescribable reduction factor,wt denoting the weighting vector before the adaptation step,wt+1 denoting the weighting vector after the adaptation step,ηt (t=1, . . . ) denoting a prescribable step size sequence,κε[−1; 1] denoting a risk monitoring parameter,κ denoting a risk monitoring function κ(ξ)=(1−κsign(ξ))ξ,∇J(;) denoting the derivation of the function approximator according to its weights, andr(xt, at, xt+1) denoting a gain upon the transition of state from the state xt to the subsequent state xt+1.