1. Field of the Invention
The present invention relates to an information processing apparatus, an information processing method and a computer program. More specifically, the invention relates to an information processing apparatus, information processing method and computer program that creates a Partially Observable Markov Decision Process (POMDP) based on input data such as attribute data (attribute-value) having correspondence between attribute information and the value.
2. Description of the Related Art
As one of state estimation and action determination methods, processing has been known that applies a Partially Observable Markov Decision Process (POMDP). The outline of the partially observable Markov Decision Process (which will be called POMDP) will be described.
A POMDP is a method for performing state estimation and/or action determination by applying:
(a) state information (State space): S,
(b) action information (Action space): A,
(c) observation information (Observation space): O, and
(d) reward information (Reward space): R
where those information pieces vary with the passage of time (t). State estimation and/or action determination is/are performed based on obtainable information and a definition function by defining a function of calculating a state transition probability, a function of calculating a reward and a function of calculating the probability of occurrence of an observation state.
The functions to be defined and used may include:
a state transition probability calculation function:T(st, at−1, st−1)=P(st|at−1, st−1)of calculating a state transition probability from a state S=st−1 and an action A=at−1 of a time T=(t−1) to the state S=st of the next time T=(t),
a reward function:R(st,at)of calculating a reward from the state S=st and action A=at at the time T(t), and
an observation state probability function:O(st, at−1, ot−1)=P(ot|at−1, st)of calculating the probability of occurrence of an observation state at the time T=(t) from the action A=at−1 at the time T=(t−1) and the state S=st at the time T=(t).
A POMDP is a method that performs state estimation and/or action determination processing by applying the information pieces and functions. A POMDP is applicable to the determination of various actions. For example, a POMDP may be applied to processing of determining an action determined as an optimum one from a few obtainable information pieces. More specifically, a POMDP is applicable to processing of determining an action of a robot, a simulation using a computer, data processing, and processing of determining an optimum action for running an enterprise.
With reference to FIG. 1, state estimation and/or action determination processing by a POMDP applying the information pieces above will be described. FIG. 1 shows the state st−1, action at−1, reward Rt−1, and observation ot−1 at a time T=t−1 and the state st, action at, reward Rt and observation ot at a subsequent time T=t. The arrows connecting the blocks indicate mutual influences. FIG. 1 shows that the information and/or state of the origin (parent) of each of the arrows may possibly influence on the state and/or information of the destination (child) of the arrow.
For example, at a time T=t−1,
the reward Rt−1 is obtained by the reward function:
R(st−1, at−1) as described above based on the state st−1 and action at−1 at the time T=t−1.
The observation information ot−1 may be observable information that varies with the change in state st−1, for example.
Those relationships are also true at all times T=t−1, t, t+1 and so on.
In the relationships at a different time, the relationship between the state st at a time T=t and the state st−1 and action at−1 at the time T=t−1 have correspondence based on the state transition probability calculation function:T(st, at−1, st−1)=P(st|at−1, st−1)In other words, the probability of occurrence of the state st at the time T=t is calculated from the state st−1 and action at−1 at the previous time T=t−1. The relationship is typically satisfied during the period among serial event observation times.
In this way, a POMDP defines various information pieces (state, action, reward and observation information) in a target area including uncertainty and may estimate the state transition in the target area including uncertainty and/or determine a self-action based on a link among those information pieces. In the action determination processing, processing is performed of determining the action calculated as the one with the highest reward as an optimum action.
Notably, it is important in processing of constructing a POMDP to define the link among information pieces (such as state, action, reward and observation information) accurately, and a Bayesian Network is used for the processing. A Bayesian Network is a network including multiple nodes and defining the link among the nodes. The processing of creating and using a Bayesian Network is disclosed in US Patent Application Publications 2004/0220892 and 2002/0103793 (Patent Documents 1 and 2). Patent Documents 1 and 2 disclose processing for creating a Bayesian Network with high reliability, which defines the link among nodes accurately.
For example, as described above, in the POMDP model described with reference to FIG. 1, it is important to define various information pieces such as:
(a) state information (State space): S,
(b) action information (Action space): A,
(c) observation information (Observation space): O, and
(d) reward information (Reward space): R, and
a function of calculating a state transition probability, a function of calculating a reward, a function of calculating the probability of occurrence of an observation state and so on, and special knowledge and experiences are important for the processing of constructing the POMDP model.