In statistics, a “multi-armed bandit” problem (referencing the “one-armed bandit” term used for a slot machine) consists of determining which one of multiple “arms” or levers to select in each of a series of trials, where each lever provides a reward drawn from a distribution associated with that specific lever. The objective is generally to maximize the total reward earned through a sequence of pulls of the levers. Generally, one has no initial knowledge about the levers prior to the first trial. The decision of which lever to select at each trial involves a tradeoff between “exploitation” of the lever that has the highest expected reward based on previous trials, and “exploration” to get more information about the expected reward of each lever. While various strategies have been developed to provide approximate solutions to versions of the multi-armed bandit problem, these solutions often have limited applicability to specific real world circumstances due to their reliance on certain constraints or assumptions regarding the underlying problem.
Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. For example, a model may be implemented as a machine learning model. A machine learning algorithm may be used to learn a machine learning model from training data. The parameters of a machine learning model may be learned in a process referred to as training. For example, the parameters or weight values of a machine learning model may be learned using training data, such as historical data that includes input data and the correct or preferred output of the model for the corresponding input data. A machine learning model may be used to compute predictions based on historical data.