Machine learning may be applied to automatically generate a computer model that is improved through experience. Applications of machine learning range from data mining programs that discover general rules in large data sets, to information filtering systems that automatically learn users' interests.
Typically, a machine learning system collects data over a period of time. In order to preserve resources for online services, the system generates or updates the model offline based upon the collected data. The model may then be applied to generate decisions in various scenarios.
A machine learning system may apply a number of different machine learning algorithms. These algorithms include supervised learning, unsupervised learning, and Reinforcement Learning (RL).
The term Reinforcement Learning may refer to the family of learning mechanisms where an agent learns from the consequences of its actions. More specifically, an agent attempts to optimize a sequence of decisions to maximize the accumulated reward over time, where the reward corresponds to feedback pertaining to goal achievement. This broad definition of Reinforcement Learning encompasses techniques from several fields; standard texts include: “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto, MIT Press (1998), “Dynamic Programming and Optimal Control” by Dimitri P. Bertsekas, Athena Scientific (2007), Approximate Dynamic Programming: Solving the Curses of Dimensionality” by Warren B. Powell, Wiley, (2011) and “Markov Decision Processes: Discrete Stochastic Dynamic Programming” by Martin L. Puterman, Wiley-Blackwell (2005).
In RL, a model may be defined by a value function used to determine a value for a particular state. More particularly, the value of a given state may be defined by the expected future reward which can be accumulated by selecting actions from this particular state and the sequence of subsequent states. Actions may be selected according to a policy, which can also change. The goal of the RL agent is to select actions that maximize the expected cumulative reward of the agent over time.
RL methods can be employed to determine the optimal policy. More particularly, the optimal policy maximizes the total expected reward for all states.