This specification relates to content recommendation systems.
Some systems that exist treat content recommendation as a contextual bandits problem. In particular, these existing systems receive contextual information for a content recommendation and select an action, e.g., a piece of content to be recommended, based on the contextual information and on rewards received in response to previous content recommendations made by the system. The received rewards generally depend on how successful the content recommendation was, e.g., on whether a user clicked on an advertisement that was recommended, or on whether a user elected to view a recommend piece of media content.