1. Field of Invention
The present invention generally relates to policy-based controllers and policy-based process servers.
2. Backgroundxe2x80x94Discussion of Prior Art
This section puts the invention into its proper context. We provide a cursory background and define required terminology. Readers unfamiliar with stochastic control, reinforcement learning, or optimal process control may find the next several subsections helpful in defining the fundamental underlying technologies. Readers very familiar with these topics should at least skim these sections to review general terminology.
A. Scope of Applicability and Main Concepts
This invention is closely related to technologies of Stochastic Control and Reinforcement Learning. Control systems technology is rather well-developed and has numerous sub-areas. Because of this the reader may be accustomed to different terminology to refer to the concepts used here. The terminology we use is in line with definitions employed in [Kaelbling Littman and Moore 1996] and [Sutton and Barto 1998], which provide background survey information, tutorial treatment, precise definitions of technical concepts discussed here, and as well as a clear explanation of the prior art.
Any concepts that are not standard fare in these references are defined here in order to provide a self-contained description. We try to introduce a bare minimum of technical jargon. Crucial technical definitions are formalized using mathematical notation in the sections titled xe2x80x9cFormal Definition of Prior Artxe2x80x9d and xe2x80x9cFormal Definition of the Mixture of Policies Framework.xe2x80x9d
1. Separation of Policy and Execution
In the technical jargon of control theory, the mapping of a stimulus to a set of action tendencies is referred to as a xe2x80x9cpolicy.xe2x80x9d Given a set of candidate actions and a stimulus, a policy is a function that recommends one or more actions in response to the given stimulus. Stochastic Control pertains to the technology of using a stochastic policy to controlling action selection processes. FIGS. 1A and 1B illustrate examples of policies. An action selection module then uses a policy to guide its selection of the action or actions from the permissible set of candidate actions. Some control mechanisms specified in the prior art do not separate policy from execution, but here we do. The essential concepts remain whether or not the execution mechanism is inextricably intertwined with the policy data structure or separated as is the case here. The policy xe2x80x9crecommendsxe2x80x9d actions, the action selection module xe2x80x9cexecutesxe2x80x9d one or more actions according to this recommendation. This execution mechanism can be straightforward, such as the greedy method of always selecting the highest ranked action. Or it can be more involved, such for example additional checks are made to determine whether an action will conflict with other ongoing actions before triggering it. (See the tutorial references [Kaelbling Littman and Moore 1996] and [Sutton and Barto 1998] for more discussion of how to convert policy information into action selection procedures.)
2. Controllers can Trigger xe2x80x9cActionsxe2x80x9d as well as xe2x80x9cProceduresxe2x80x9d
Although we speak about xe2x80x9cactionsxe2x80x9d and xe2x80x9caction selection,xe2x80x9d the controllers described in this document can also regulate procedures. Therefore, an xe2x80x9caction selection modulexe2x80x9d as defined here can control (a) instantaneous actions, (b) ballistic (non-interruptible and non-modifiable) action sequences, but can also regulate (c) ongoing physical processes or (d) branching procedures.
Actions controlled or initiated by a policy can be
1. Momentary or instantaneous: e.g., flash a light bulb, flip a switch.
2. Continuous: e.g., gradually increment the temperature of a furnace over time.
3. Procedural: initiate a multiple step and possibly branching computer program.
Furthermore, actions can be
1. Discrete: e.g. a database containing a finite set of actions indexed by an integer record pointer. An example of this is an web-based ad server for the purpose of displaying a particular ad targeted at a website visitor.
2. Continuous: e.g., a possibly multidimensional control signal indexed by a point within a Euclidean vector space, such as an electronic control system. An example of this is an electronic vacuum pressure regulator inside an automobile.
We refer to actions for simplicity but without loss of generality because an action can mean triggering a procedure, parameterizing the initial state of a procedure, or modifying state information used by an ongoing procedure.
3. Compatible with Reinforcement Learning Technologies
Although this invention does not provide new technology for learning per se all the policy and control mechanisms described here are compatible with the general framework of reinforcement learning theory. As is apparent from the prior art, the general approach used here (i.e., encapsulation and modularization of the data structures and mechanisms involved in formulating policy and executing policy) reduce the computational burden of obtaining policy information. Various statistical, computational, and programming technologies can be applied to obtain a policy. These technologies are well developed and include a wide variety of computational, statistical, and electronic methods. Methods for obtaining or refining policy include (a) explicit programming, (b) direct computation, (c) evolutionary design, (d) evolutionary programming, (e) computerized discovery over historical data stores, (f) computerized statistical inference over historical data stores, (f) computerized real-time direct search, and (g) real-time reinforcement learning. See [Kaelbling Littman and Moore 1996] and [Sutton and Barto 1998] for a review and additional references.
A policy can be
1. Probabilistic: actions are weighted by a probability distribution over the action database. In this case the action selection module picks one action at random drawn according to this distribution. See for instance FIG. 1A.
2. Deterministic: only a single action is recommended. See for instance FIG 1B.
The field of Reinforcement Learning provides technologies for systematically learning, discovering, or evolving policies suitable for stochastic control. Reinforcement learning theory is a fairly mature technology. The field of Fuzzy Control modifies this functionality to allow the following:
3. Fuzzy Membership Assignment: a distribution (possibly non-probabilistic) is applied over the actions in the action database. See FIG 1C.
Given a fuzzy policy the action selection module simultaneously applies one or more of the actions. Therefore, fuzzy control as defined here allows multiple actions to be triggered in parallel. Moreover the action selection mechanism may also utilize the weighting specified by the distribution to initialize parameters of each action. See for instance FIG 1C.
The definition of Fuzzy Policy we use here may be inconsistent with definitions used in prior art, and is not included in the tutorial treatment explained in [Kaelbling Littman and Moore 1996] and [Sutton and Barto 1998], which concentrate exclusively on stochastic control. However, Fuzzy Policy as defined here is related to xe2x80x9cfuzzy setsxe2x80x9d in that they both specify xe2x80x9cdegree of membershipxe2x80x9d rather than xe2x80x9cprobability.xe2x80x9d Fuzzy Policy as defined here also allows more than one action to be selected in parallel by the action selection mechanism, whereas a stochastic policy expects only a single action to be selected at one moment in time.
4. A Policy is a Mulit-valued xe2x80x9cRecommendation,xe2x80x9d a Value Function is a xe2x80x9cRankingxe2x80x9d
Closely related to the notion of xe2x80x9cpolicyxe2x80x9d is the xe2x80x9cvalue function.xe2x80x9d Rather than a probabilistic distribution over the action database, a value function assigns a numerical weight to each action. A policy formulation mechanism then converts this value function into a policy. What we define as a xe2x80x9cfuzzy policyxe2x80x9d suffices for representing value functions. Therefore, we can manipulate value functions by treating them as Fuzzy Policies.
Technology for converting a single value function into a policy is standard fare in prior art cited here. However, prior art does not address the combination of multiple value functions (see FIG 1G) or the simultaneous collapse of multiple value functions into a single stochastic policy (see FIG. 11), or the convergence of multiple stochastic policies in order to obtain a new value function (see FIG 1H).
5. General Applicability and Specific Practical Advantages
This invention is generally compatible with the technologies of reinforcement learning, stochastic control, and fuzzy control. Therefore it has broad scope because of the broad scope of these technologies. These wide-ranging technologies can be used to leverage this invention in a wide variety of ways. Despite the wide-ranging theoretical applicability of these technologies they have limits in certain practical applications. The next section homes in those limitations that are relevant to this invention.
B. Brief Overview of Prior Art
For comprehensive survey or tutorial treatment see [Kaelbling Littman and Moore 1996] or [Sutton and Barto 1998]. We proceed directly to discussing the currently most advanced technology upon which this invention serves to improve.
One of the key constraints upon efficient execution of stochastic control is the computational complexity of the policy information. For background see especially the discussion on compact mappings in the tutorial references [Kaelbling Littman and Moore 1996] and [Sutton and Barto 1998]. However, compact mappings do not completely alleviate the computational cost of learning and executing complex policies. Although a compact map does provide size and speed advantages over a method relying upon less compact data structures, even this approach will rapidly be overwhelmed by the complexity of common practical tasks. Additional efficiencies can be gained by breaking down a policy into modular sub-components. The xe2x80x9cgated policyxe2x80x9d approach splits the policy into a set of sub-policies and uses a gating mechanism to select from among the sub-policies. This approach has numerous variations and encapsulates numerous complexities that are not exhaustively described here; however, a simple high-level illustration of the essential features of the general approach relevant to this invention is depicted in FIG 1D.
Given a stimulus s, this xe2x80x9cgated policyxe2x80x9d mechanism selects the sub-policy appropriate for the stimulus at hand, passing that policy through to an action selection module, which then executes that sub-policy upon the given stimulus. xe2x80x9cStimulusxe2x80x9d as defined here is quite general, encompassing external sensory stimuli as well as state space accessed within internal memory.
The gated policy approach can make executing or learning stochastic policy information more efficient. It streamlines the acquisition of policy, say, by computerized discovery, exhaustive search, reinforcement learning, or iterative evolution. This is because sub-policies may be more easily obtained individually than can a single monolithic policy. It also streamlines the subsequent refinement of a complex control policy by allowing xe2x80x9clearningxe2x80x9d to occur hierarchically at multiple levels of description. (Note that while FIG ID depicts a single level of sub-policies the method can be applied to each of the sub-policies to generate an additional level in the hierarchy, and this decomposition can be applied repeatedly to obtain a hierarchy with multiple levels.) The modular policy approach also streamlines the execution of policy, because multiple simpler sub-policies can replace a complex monolithic policy. It also allows policies stored in different data structures to be combined (e.g., compact maps, database tables, decision trees, procedural code). Therefore, this general approach of xe2x80x9cdivide-and-conquerxe2x80x9d has numerous valuable benefits. Methods that can make efficient use of modular policies have several practical advantages over methods that wield a monolithic policy.
C. Formal Definition of Prior Art
Here we formalize the concepts introduced above.
Current implementations of process controllers typically employ a single method for defining policy (e.g., rules-based, or statistical, but not both). Current technologies based upon a purely rules-based approach can require a large number of rules that take up much space and are costly to evaluate in real-time. Current applications of machine learning and datamining embedded in commercially available process controllers are good for operating on some types of data but limited upon others. (E.g., a web-based personalization server based upon collaborative filtering is good for inferring preference based upon on-site browsing behavior but may be much less useful for deducing preference from an explicit profile provided via questionnaire.) Also, machine learning methods are great for learning from example, but are also largely limited to learning from examplexe2x80x94users often need more direct control of the process controller, e.g. by encoding certain rules of behavior explicitly. Therefore, different tasks call for different control strategies, and different control strategies call for different data structures storing policy information, and different strategies for obtaining or refining that policy information.
Even though there are numerous types of data structures for encoding policy information these types can be unified within a single general framework using concepts from reinforcement learning. The reinforcement learning terminology we employ here equates xe2x80x9cagentxe2x80x9d with xe2x80x9cprocess controllerxe2x80x9d or xe2x80x9cprocess serverxe2x80x9d so we will refer to an xe2x80x9cagentxe2x80x9d henceforth instead of xe2x80x9ccontroller.xe2x80x9d The concept of xe2x80x9cagentxe2x80x9d is also more general than the term xe2x80x9ccontroller,xe2x80x9d and is more appropriate for the computational server applications being emphasized here.
Consider an xe2x80x9cagentxe2x80x9d located in an environment. The agent""s xe2x80x9cenvironmental statexe2x80x9d or, xe2x80x9cstimulusxe2x80x9d is a (possibly highly processed) version of the environment xe2x80x9cexternalxe2x80x9d to the agent. Therefore, whereas (in typical usage of the term) a xe2x80x9ccontrollerxe2x80x9d reacts to sensory information directly or subsequent to some numerical processing, an xe2x80x9cagentxe2x80x9d can react to highly processed information. The agent""s external sensors and internal state memory define this stimulus state, which we model as a d-dimensioned real-valued vector space:
S⊂d, dxcex5+.
This state could, for example, be the onsite behavior of a website shopper, such as shopping basket contents or page view sequence. Or it could be based upon statistics inferred from historical memory of past purchases by that shopper. In this example the candidate actions each could select a single product recommendation from among a large set of available products, or sort a list of product recommendations in a particular way, or display a link to a particular page. Alternatively, this state could be the stimulus experienced by a robotic toy doll, and the candidate actions each select an appropriate facial expression and body pose in reaction to that stimulus.
For simplicity, we will take the set of available actions to be a discrete set of r actions for some integer r:
A={a1, a2, . . . , ar}.
Each action axcex5A is a pointer into a database of r procedural routines. A(s)⊂A gives the actions available while in state sxcex5S.
Continuous action spaces are useful for some applications, but are not necessary to illustrate the main concepts being described here. For clarity we introduce the main concepts using discrete action spaces. It is straightforward to extend these concepts to continuous action spaces and the mechanisms for doing so are rather obvious to the informed technologist by drawing upon references such as [Kaelbling Littman and Moore 19961 or [Sutton and Barto 1998] for guidance.
Consider a sequence of stimuli s1, s2, s3, . . . For each t=1,2, . . . , a xe2x80x9cpolicyxe2x80x9d xcfx80 applies a linear order to the set of actions available for responding to stimuli st. Above we briefly mentioned the distinction between a value function and a policyxe2x80x94the tutorial texts referenced above describe this distinction very clearly. Ultimately the value function must be converted to a policy when applied to action selection and so controllers based upon the modular policy approach commonly apply the modularity within xe2x80x9cpolicy spacexe2x80x9d rather than in xe2x80x9cvalue function space.xe2x80x9d However, one embodiment of this invention (described in the specification and claims provided below) is suitable for combining policy information in xe2x80x9cvalue function space.xe2x80x9d For clarity in explanation and simpler notation we confine our description to xe2x80x9cpolicy space.xe2x80x9d Upon recognizing the drawbacks of prior art and the specific advantages of this invention, a reasonably capable expert can easily extend this method to apply to value functions without requiring any insights that are not obvious from reading this document or from the prior art cited here.
Intuitively, a policy can be said to model a set of xe2x80x9cbehaviors,xe2x80x9d or xe2x80x9caction tendencies.xe2x80x9d A policy can be deterministic (say, choose the highest ranked action as indicated by a value function) or stochastic (i.e., select one of the actions probabilistically). A stochastic policy implements the mapping:
xcfx80:Sxc3x97Axe2x86x92[0,1],
for state st⊂St at time t choosing action at with probability
xcfx80t(s,a)=P[at=a|st=s].
A static stochastic policy is one where no adaptation occurs over time such that xcfx80t(s,a)=xcfx80(s,a), t=1,2, . . . First, we consider a policy that is not modified by learning over previous actions during the lifetime of the agent. For stimulus state s⊂S at time t and action a⊂A, static stochastic policy xcfx80t sets the probability with which action a is chosen to
P[at=a|st=s]=xcfx80t(s,a),
Note that stochastic control subsumes deterministic control; therefore, this type of policy can implement deterministic behaviors (e.g., via simple rules or procedural script). A number of ways exist to compose an action selection rule from a policy which we omit here for brevity (case studies are provided in [Sutton and Barto, 1998] and [Kaelbling, Littman, and Moore, 1996]). Additional details for converting policy to action selection and for learning or evolving policy are omitted because the essentials of this patent are focused mainly within policy formulation and combination and these details are easily obtained from the references to prior art cited here. Intuitively, a policy ranks the list of candidate actions from which action selection thereby selects a single action function according to that ranking.
Fuzzy controllers as defined here can trigger multiple actions in parallel. Also, because a xe2x80x9cfuzzy policyxe2x80x9d as defined here is a non-probabilistic distribution, a fuzzy policy formally subsumes stochastic policy. But we describe the prior art involving the triggering of single actions under the stochastic framework for several reasons. (a) It is often quite straightforward to reduce the simultaneous triggering of multiple actions into the framework of single actions. (b) Stochastic control is more familiar to experts and practitioners of intelligent control technologies. (c) It is easier to describe the general mechanism by considering the special case of stochastic control than if we attempt to retain full generality throughout the entire discussion. Upon recognizing the drawbacks of prior art and the specific advantages of this invention, a reasonably capable expert can easily extend this method to apply to fuzzy policy without requiring any insights that are not obvious from reading this document or from the prior art cited here.
The notation used to denote policy thus far does not admit real-time learning. Reinforcement learning allows a policy to depend upon (i.e., be conditioned on) previous events experienced by the agent. Therefore, we have a dynamic stochastic policy xcfx80ct that for state sxcex5S chooses action a with probability
P[at=a|st=s]xe2x89xa1xcfx80c,kt(s,a,at,k, st,k),
where now policy execution over state space (the current action ranking) is function of the k previous actions and stimuli:
xcfx80c,kt(.,.)=f(at,k, st,k),
where at,k and st,k are the historical sequences of the k previous actions and states respectively, such that at,k=atxe2x88x92k, atxe2x88x92k+1, . . . , atxe2x88x921, and st,k=stxe2x88x92k, stxe2x88x92k+1, . . . , stxe2x88x921. For simplicity in what follows we""ll let k=t (indefinite memory), and denote at=at,t, st=st,t, xcfx80c,tt=xcfx80c,tt, and refer to xcfx80ct instead of xcfx80c,tt. Where confusion will not arise we may abuse notation slightly and use xcfx80ct (s, a) rather than xcfx80ctt (s, a, at, st), so long as it is clear that the computation of xcfx80ct depends upon previous states and actions, whereas it does not for xcfx80 and xcfx80t. Reinforcement learning and supervised learning theories each provide several mechanisms entirely suitable for computing f (and thereby, xcfx80ct). For a survey of these mechanisms see [Sutton and Barto, 1998] and [Kaelbling, Littman, and Moore, 1996].
Different ways of encoding policy are useful for different purposes. A static policy is useful for encoding simple rules (say, describing expert intuition). A dynamic policy acquired in real-time via statistical learning is good for tracking user behavior via passive observation. In theory, we can easily combine these into a single policy. But in practice, there are good reasons to keep each type of policy separate. One reason is computational efficiency. Simple rules can be efficiently coded as a look-up table. On the other hand, a functional form that is efficient for a simple policy xcfx80t (say, requiring only a small table of rules) will in general be inefficient for a complex policy xcfx80ct (for which a compact map will be necessary in general to reduce space requirements). Another reason is modularity. Functional cohesiveness applied to policy improves ease of maintenance.
Conditioned policy obtained by reinforcement learning can be improved further. E.g., it does not yet permit the explicit modeling of particular types of conditioned response that localize certain types of conditioning to particular regions of stimulus space. Both of these issues benefit from a straightforward extension known as a gated policy, as shown in FIG 1D. For a survey of such methods see [Kaelbling, Littman, and Moore, 1996]. A gating function decides which policy should be switched through and actually executed based on the stimulus state.
The xe2x80x9cgated behaviorsxe2x80x9d approach includes a wide variety of methods, from single-level masterslave, to hierarchical-level xe2x80x9cfeudal Q-learningxe2x80x9d [Dayan and Hinton, 1993]. In Maes and Brooks [1990] the policies were fixed and the gating function was learned from reinforcement. [Mahadevan and Connell 1991] fixed the gating function and trained the policies by reinforcement. [Lin 1993], [Dorigo and Colombetti 1994], and [Dorigo 1995] trained the policies first and then trained the gating function. Dietterich and Flann explored hierarchical learning of policy [Dietterich 1997], [Dietterich and Flann 1997]. Whereas these prior art references concentrate on learning the modular sub-policy information, this invention provides a means for combining it in a better way, while still allowing still these methods for learning the sub-policy information to be applicable.
Now we formalize the gated policy approach. This will be useful for clearly defining the novel features of this invention when we formalize its essential features in the specification of the main embodiment below. Let xcfx80ct be a gated policy over a single level of v sub-policies (xcfx80c,1t, xcfx80c,2t, . . . , xcfx80c,Mt), with gating function g : Sxe2x86x92{1,2, . . . , v}, which chooses the policy appropriate for the given stimulus state. As with the policies previously defined above, this policy sets the probabilities associated with action tendencies:
P[at=a|st=s]xe2x89xa1xcfx80ct(s,a) ,sxcex5S,axcex5A.
If xcfx80ct is to be obtained by a gated selection from a (nonhierarchical) set of sub-policies, then
xcfx80ct(s,a)=xcexa31xe2x89xa7ixe2x89xa7v[xcfx80c,it(s,a)Ii(g(s) )],
where for any integer a, Ii(a) is an indicator function that is equal to 1 when a=i, and 0 otherwise. Note that although this equation involves a summation, it is essentially describes a xe2x80x9cswitchxe2x80x9d that enables one and only one sub-policy. The indicator function Ii(g(s)) serves as the xe2x80x9cswitch.xe2x80x9d The corresponding action selection drawn accordingly, e.g., (say) by random draw from the actions database according to the policy action probability specified by the selected sub-policy. This invention improves upon the gated approach by replacing the indicator function Ii(g(s)) with a weighting function.
To summarize, gated policy methods exemplify the prior art that is improved upon by this invention. Closely related methods are also referred to using terms such as xe2x80x9chierarchical learning,xe2x80x9d xe2x80x9clayered control,xe2x80x9d and xe2x80x9cmodular policies.xe2x80x9d Gated policy methods can compartmentalize learning and response based upon the input state, and can also allow learning to occur at different levels of analysis. In principle, this could be achieved equally well by a monolithic (i.e., non-modular) system, albeit at possibly much more computation required in practical application. I.e., this type of modular policy reduces to a single policy, albeit one obtained by piecemeal composition of sub-policies over state space. Said again in different terms, the sub-policies do not overlap in input space. This constraint is enforced upon all gated policy methods, either explicitly (in that policies respond to mutually distinct portions of the input space) or implicitly (because of the effects of the gating mechanism policies effectively respond to mutually distinct portions of the input space).
D. Drawbacks of Prior Art
The gated policy approach possesses inherent constraints that limit its use. The gated policy approach does not allow multiple overlapping policies to be combined in order to act upon the stimulus in concert. The gated policy approach instead selects a single sub-policy by a crisp selection. There exist practical applications for which overlapping sub-policies are very useful. Another drawback of the gated policy approach is that it can only select from among available policies, it cannot combine them to obtain a compositional policy that is better suited than any of the available policies are individually.
This invention allows multiple overlapping policies to be combined, and this is the central innovation of this patent. Rather than use a crisp selection, this invention employs a xe2x80x9csoftxe2x80x9d mixture of policies.
Another drawback of the prior art is that the gating mechanism cannot smoothly transition from one policy to another. The switching mechanism is crisp. If the mechanism switches from one policy to another that is markedly different, the resulting change in the behavior will in general be markedly different as well. There are many applications where it is highly desirable to switch from one control regime to another in a smooth fashion.
This invention allows a controller to effect a smooth transition from one policy to another over time.
E. Example Application Illustrating Drawbacks of Prior Art
Here is a description of a practical application intended to highlight specific drawbacks of the prior art.
An electronic commerce website currently utilizes several servers. Each server controls how resources are to be presented to the online shopper. Resources can include product descriptions, suggested product recommendations, or product pricing information. Each server wields a policy that dictates the probability of presentation over the same set of resources. An executive procedure uses this policy to guide how these resources are displayed. But each server uses a somewhat different type of information to formulate its policy. Several such servers are required because each one is especially well-suited for handling particular types of information. One server observes on-site behavior (e.g., pages viewed, browsing behavior). Another server is aware of the user""s past purchase transaction history. Another server is able to make recommendations based upon an explicit user profile. Each server is a back-end process controller capable of controlling various front-end processes, such as displaying ads, selecting the presentation of content, or making product recommendations. Conceptually, there is really just a single source of information: i.e., the shopper""s behavior. In the space defined by shopper behavior, the input space of these servers xe2x80x9coverlap.xe2x80x9d But because different data structures are used to record shopper behavior, each server seems to operate on a different type of information. Therefore, at the most important levelxe2x80x94that being to server the shopperxe2x80x94these servers are wielding overlapping policies. (This example is kept simple for clarity. However, it can be modified slightly to illustrate the practical reality that such servers will often overlap much more explicitly. For example, a shopper answering a questionnaire can result in new information being shunted to both xe2x80x9con-site browsing behaviorxe2x80x9d as well as xe2x80x9cuser questionnairexe2x80x9d data structures.)
To reiterate, this example has three servers, each one responding to a different type of information source:
1. on-site browsing behavior
2. explicit user profile or questionnaire
3. past purchase history
In this example, all three servers are necessary because no single server can do the entire job effectively. How can the operation of these servers be seamlessly integrated in order to leverage the best attributes of each one?
Suppose only one type of information is available for the visitor (say, there is on-site behavior, but neither explicit user profile nor past purchase history). In this case it is easy to solve the problem at hand: simply select the server that responds to on-site behavior. However, if two types of information are available (say, on-site behavior, and explicit user profile) then the situation is made more complex. Given the prior art the options become:
1. select one server or the other
2. obtain a new server that can utilize both sources of information
An additional option would be desirable. If the webmaster could combine the two existing servers together to utilize them in concert than the task would be handled more effectively. Conceptually this reduces to combining two (possibly overlapping) process control mechanisms.
One benefit of a seamless combination of the two existing servers would be to smoothly transition from one server to another. A first-time shopper will quickly generate on-site browsing behavior but won""t have past purchase history and may not wish to fill out an explicit user profile. This makes the first server appropriate, and the other two servers completely useless. However, once the shopper generates some purchases, the third server becomes useful. But rather than simply switching over to the third server in a radical fashion as soon as past purchase history becomes available, this invention provides a means to migrate smoothly from one server to the other. The gated policy mechanism is incapable of performing this smooth transition.
Furthermore, a policy obtained by combining the three servers can make best use of each server, using them in concert rather than relying on only one or the other. In some cases, the xe2x80x9con-site browsing behaviorxe2x80x9d server will provide the best information. In others, the xe2x80x9cexplicit user profilexe2x80x9d will be most effective. But in yet others, no one server will be most effective; rather, a combination of their policies will yield a recommendation that is better than either one individually. While the gated policy mechanism is highly capable of making best use of its individual sub-policies, it is incapable of mixing multiple policies together.
A. Brief Overview of Novel Features of the Invention
This invention allows process controllers to utilize overlapping policies. See FIG. 2 for a conceptual overview of the general mechanism. Overlapping policies occur when multiple policies
can respond effectively to the same stimulus while mapping to the same or different policy space, or
map to the same policy space while responding to stimuli that are different but which occur simultaneously, such as controllers that react to different sources of information.
There is good reason for using overlapping policies. It allows a process controller to wield multiple utilities. Different utilities can be used under different circumstances, and the process controller can then wield a xe2x80x9cmixturexe2x80x9d of utilities. Intuitively, the process controller is able to smoothly apply a multitude of motivational tendencies upon action selection. An immediate consequence is that the process controller can combine controllers that operate on different sources of information. As pointed out by [Sutton 1992] and [Brafman, Tennenholtz, 1996], rational agents are either (a) maximizers of expected utility or (b) reinforcement learners. Process server tasks (e.g., website personalization) naturally admit multiple xe2x80x9cutilitiesxe2x80x9d (respectively, xe2x80x9ctypes of preferencesxe2x80x9d). These utilities correspond to the having multiple objective criteria to be optimized by the controller (respectively, multiple mental states of the userxe2x80x94e.g., attitude, mood, objective, taskxe2x80x94or multiple resources being quantified by the server xe2x80x94e.g., dollars spent, units of product sold, number of page views browsed). Or they can (say) correspond to different ways of measuring a single criterion (e.g., xe2x80x9cuser preferencexe2x80x9d can be measured in multiple ways, e.g., by first-person subjective opinion via questionnaire, passive observation of actual tendencies, or by comparison to other similar people via collaborative filtering).
The canonical gated policy approach defined above is lacking in several ways:
(1) It has no explicit representation of multiple sources of overlapping policy information.
(2) It has no capacity for smoothly integrating multiple policies.
(3) It has no means for smoothly shifting control from one policy to another.
These limitations are resolved by this invention.
This extension extends modular stochastic control to allow simultaneous application of more than one policy to any particular stimulus (i.e., xe2x80x9coverlapping policiesxe2x80x9d). This exact framework is novel, however, it is similar in spirit and analogous in approach to the Mixtures of Controllers approach [Cacciatore and Nowlan 1994], which is an extension of the well-known Mixture of Experts approach [Nowlan 1990], [Jacobs et al 1991]. One embodiment of the mixture mechanism is a recurrent mechanism analogous to the mixture mechanism used in the mixture of controllers method, but with additional features that allow it to apply to a mixture of policies. These features handle additional complexities that arise when combining policy information that are not an issue when combining either (a) single control signals or (b) single recommendations.
The Mixture of Controllers approach combines the control signals produced by multiple controllers that regulate the same control element. Each sub-controller submits a single control signal to the mixture mechanism, which combines these into a single control signal that is passed on to the controlled element. In that approach the combination is done on each individual control signal (which in the terminology adopted here, corresponds to the control of an individual action), whereas this invention combines entire policies before the control signal (or alternatively, recommendation) is generated.
Recall that a policy corresponds to an entire set of actions. A mixture of policies is more useful for certain practical applications because it is directly applicable for stochastic selection from a database of discrete actions instead of regulating a continuous control signal. For example, this invention is more directly applicable to website personalization tasks than is the Mixture of Controllers approach. Also, this invention separates xe2x80x9cpolicyxe2x80x9d from its xe2x80x9cexecution,xe2x80x9d whereas the Mixture of Controllers approach does not.
From computer science in general and operating systems in particular it is well understood that this basic encapsulation principle has many advantages, analogous to the way U.S. government separates the formulation of policy from its execution by separating the legislative branch from the executive branch. In addition, this invention provides an additional mechanism for encapsulating xe2x80x9cconflict detection,xe2x80x9d analogous to the judicial branch of the U.S. government. This conflict detection mechanism preemptively detects when a policy will generate conflicts during execution, and also resolves those conflicts.
The Mixture of Experts approach is a prior art that effectively combines multiple policies; however, the Mixture of Experts approach operates in xe2x80x9crecommendation space.xe2x80x9d This broad class of methods includes (a) voting mechanisms, and (b) weighted averaging mechanisms, where several xe2x80x9cexpertsxe2x80x9d make a recommendation, and the several recommendations are consolidated (by voting or by weighted average, respectively). This invention differs in that the consolidation of expert xe2x80x9copinionsxe2x80x9d occurs in policy space rather than in the recommendation space.
The ability to manipulate and combine fuzzy policies has additional advantages in that it allows multiple value functions to be manipulated and combined. Technology for converting a single value function into a policy is standard fare in prior art cited here. However, prior art does not address the combination of multiple value functions (see FIG 1G) or the simultaneous collapse of multiple value functions into a single stochastic policy (see FIG. 1I), or the convergence of multiple stochastic policies in order to obtain a new value function (see FIG 1H).
The mixing function also has a temporal component for regulating the speed of transition of policy over time. See FIG 1J.
Although we describe the main embodiment of this invention with respect to computer-based server applications involving multiple process servers wielding discrete policies another embodiment of the invention applies to combining multiple continuous policies such as those found in some electronic controllers.
B. Practical Advantages of the Invention
Here we highlight the practical benefits of the novel features. Although the illustrative examples described here focus on computer-based database server applications, this method has applicability in process control in general including electronic process controllers.
Combining Policies in xe2x80x9cPolicy Spacexe2x80x9d
Combining multiple policies in xe2x80x9cpolicy spacexe2x80x9d rather than in xe2x80x9crecommendation spacexe2x80x9d delivers additional flexibility over the prior art mentioned above. For example, when mixing a probabilistic policy with a deterministic policy (having all probability concentrated on a single action), the mixture mechanism can let the deterministic policy always dominate the probabilistic policy (see FIG 1E). In some applications this is the preferred result. This reduces to a crisp selection of the deterministic policy and can be performed adequately by the prior art cited here. The Mixture of Policies approach allows this effect, but it also allows the alternative option of letting the probabilistic policy xe2x80x9csoftenxe2x80x9d the deterministic policy (see FIG. 1F). There are applications for which this is the preferred result. The prior art cited here does not allow this result.
Easier to incorporate Conflict Detectors
Combining multiple policies also allows an additional level of separation of policy and execution that is extremely advantageous when combining multiple process servers. FIG. 1G illustrates the combination of two fuzzy policies. Note that as defined here a fuzzy policy can xe2x80x9crecommendxe2x80x9d more than one action be triggered simultaneously. An agent that formulates a stochastic policy assumes that the executive will select only a single action. Therefore, conflicting actions can be recommended because the conflict is resolved by selecting only a single action. On the other hand, an agent that recommends a fuzzy policy (as defined here) expects more than one action to be selected (in general). Therefore, any mixture of multiple fuzzy policies must perform an additional check to ensure that no conflicts will arise when triggering multiple actions. This functionality is the responsibility of the mixture mechanism referred to here as the Mixing Function.
The result is a separation of xe2x80x9cconflict detection and resolutionxe2x80x9d from policy formulation and policy execution. This adds another useful level of modularity to policy-based control.
Combining Policies in Value-Function Space
A website content server may call upon multiple sub-servers that each recommend content for display. One way to combine these recommendations is to simply combine the policy information provided by each sub-server using the technique described above, which combines multiple policies in policy space. However policy space is not always be the best space in which to combine policies. For instance consider a website that is a portal which xe2x80x9caggregatesxe2x80x9d content from many other sources. Those sources can be comprised of search engines, or of content servers located at other websites. A xe2x80x9cchildren-friendlyxe2x80x9d version of the same content is desired that imposes a zero value upon pornographic content. In this case it is required that the probability of displaying pornographic content is not just negligiblexe2x80x94it must be exactly zero. Revaluing all pornographic content to zero value can perform this function. Although prior art such as simple filtering mechanisms can perform this same function, this invention allows filtering mechanisms to be seamlessly incorporated with other process controllers, to be extended to allow xe2x80x9csofterxe2x80x9d forms of filtering, and to be switched on or off at will. Therefore, while the main practical advantage of this invention is its ability to combine policy-based servers in policy space, there are practical applications in which the combination is best performed in value function space; one embodiment of this invention performs the latter task.
Therefore, because fuzzy policy can be used to represent value functions, the ability to manipulate and combine fuzzy policies has practical advantages for manipulating value functions.
It allows multiple value functions to be combined and then handed off to an action selection mechanism (such as a process server) that requires its recommendations be provided as a single value-function (see FIG. 1G)
It allows multiple value functions to be manipulated and combined in order to synthesize a single coherent policy that satisfies these multiple value-functions simultaneously to some degree (see FIG. 1I).
It allows multiple stochastic policies to be mapped back into value function space (see FIG. 1H) where they can be recombined more easily, more intuitively, or with better quality control (e.g., more safely with respect to ensuring that undesirable content will not be displayed).
Technology for converting a single value function into a policy is standard fare in prior art cited here. However, prior art does not address the combination of multiple value functions (see FIG 1G) or the simultaneous collapse of multiple value functions into a single stochastic policy (see FIG. 1I), or the convergence of multiple stochastic policies in order to obtain a new value function (see FIG. 1H).
Smooth Transition of Policy Over Time
The policy mixture mechanism has a temporal component for enforcing smooth transition of policies over time. A website server controlling a graphical interface needs to enforce continuity in order to avoid confusing the user. Discontinuity is a definite disadvantage of the prior art for combining multiple process servers. This invention provides the means to ensure that transition from one policy to another is performed seamlessly and smoothly at a rate that can be precisely controlled. FIG. 1J provides a simple example illustrating the essential elements of this transition over time. Although the sub-policies which input to the system remain unchanged over time, the mixing function adjusts the relative contribution of each policy to achieve a smooth transition from one policy to the other. Of course, this illustration is a rudimentary depiction; the time units, time scale, and number and nature of policies encountered in practical application would differ greatly in general.
Additional Objects and Advantages
Still further objects and advantages will become apparent from a consideration of the ensuing description and accompanying drawings.
The invention provides a method and apparatus for combining a plurality of overlapping policy-based process controllers via a mixture of policies mechanism. The invention is also useful for smoothly transitioning control from one controller to another. The invention is also useful for separating conflict detection and resolution from policy formulation and execution.
Many signal-processing applications used to control or regulate other systems can be treated as xe2x80x9cpolicy-based controllers.xe2x80x9d In particular, the invention is applicable to policy-based process servers as well as electronic controllers. A xe2x80x9cpolicy-basedxe2x80x9d controller admits a conceptual decomposition into xe2x80x9cpolicyxe2x80x9d and xe2x80x9cexecutive.xe2x80x9d The policy formulated by a policy-based controller is provided to an executive mechanism that then uses that policy to guide how it executes actions, such as regulating control signals, triggering procedures, or regulating ongoing processes or procedures. The concept of xe2x80x9cpolicyxe2x80x9d is quite useful because the task of regulating a policy-based controller reduces to the task of regulating the associated policy and the associated action selection executive.
A xe2x80x9cpolicyxe2x80x9d can be used to exert probabilistic control but can also be used for deterministic control. It can also be used for parallel control of multiple control signals, or for triggering multiple processes in parallel. Because xe2x80x9cpolicy-based controllersxe2x80x9d can be effectively reduced to their associated policy information, this implies that by combining their respective policies one can combine the controllers.
Separating policy from execution facilitates the design and development of flexible controllers. Decomposing a complex policy into sub-policies facilitates the design and development of flexible policies. However, the prior art are limited in their methods for handling sub-policy information. The present invention combines the several policy-based xe2x80x9csub-serversxe2x80x9d by combining the xe2x80x9csub-policiesxe2x80x9d associated with each sub-server into a single policy. The system combines multiple policy-based sub-servers by combining the associated distributional information according to a measure of relative contribution. The system allows (but does not require) temporal smoothing of the policy mixture mechanism. The system provides for detection and resolution of conflicts that will arise as a result of combining otherwise incompatible sub-policies. The preferred embodiment combines the sub-servers by combining the respective sub-policies, but another embodiment combines the sub-servers by combining the respective value functions associated with each sub-server.
A useful characteristic of policy-based controllers is the separation of policy formulation from policy execution. This invention allows another level of modularity by encapsulating the procedures required for detecting and resolving conflicts that arise as a result of combining otherwise incompatible sub-policies.
The invention is suitable for integrating multiple process servers on websites. Examples of website servers include content servers, ad servers, and recommendation engines. Examples of applications for such website servers include but are not limited to personalization systems, content servers for displaying targeted content, electronic commerce product recommendation systems, and ad servers for displaying targeted advertisements. Method and apparatus is also suitable for regulating reactive behaviors in social agents and virtual personality simulations, such as facial expressions, as well as displays of reactive affect in general, such as hand gestures and other nonverbal body language.
In another embodiment, the invention may be implemented to provide a method for combining multiple electronic controllers. Robotic toys and toy dolls exemplify the type of hardware platform that can benefit from the combination of multiple simple controllers, rather than the alternative of creating a more complex monolithic controller. The invention can be used to obtain complex controllers by combining multiple simpler controllers. Another embodiment of the invention can also be used to simplify the design and implementation of monolithic controllers by applying the engineering design discipline strategies of modularization and encapsulation. This allows the designer to more easily scale up to greater complexities. This invention provides methods for doing so which are more flexible than prior art.
Other applications are apparent to anyone familiar with the technology and with the benefit of this specification.