Many computing problems can be modeled as sequential decision problems where a policy must choose an action from a set of discrete actions at each time. The reward from this action is random and the statistics of the actions are unknown. These sequential decision problems are called “multi-armed bandit problems” and the actions are referred to as “arms” selected by “players,” borrowing from terminology associated with slot machines.