In statistics, a “multi-armed bandit” problem (referencing the “one-armed bandit” term used for a slot machine) consists of determining which one of multiple “arms” or levers to select in each of a series of trials, where each lever provides a reward drawn from a distribution associated with that specific lever. The objective is generally to maximize the total reward earned through a sequence of pulls of the levers. Generally, one has no initial knowledge about the levers prior to the first trial. The decision of which lever to select at each trial involves a tradeoff between “exploitation” of the lever that has the highest expected reward based on previous trials, and “exploration” to get more information about the expected reward of each lever. While various strategies have been developed to provide approximate solutions to versions of the multi-armed bandit problem, these solutions often have limited applicability to specific real world circumstances due to their reliance on certain constraints or assumptions regarding the underlying problem.