A typical prior art search system is configured to receive a search query form a client's computing device and to apply a ranking model that aggregates both pre-feedback features describing the content of web pages and prior-history features based on user behavior data stored in query logs do determine one or more web pages to be presented as responsive to the search query in a form of a Search Engine Results Page (SERP).
This leads to the following iterative process of interaction with users that repeatedly submit a particular query (albeit from different users). At the first stage, when the query is relatively new to the system, the search engine ranks web resources by the scores using their pre-feedback information only. Further at the second stage, it corrects this ranking with collected implicit feedback data. During this stabilizing phase, scores of top ranked web resources which get negative user feedback become lower, so these web resources are exchanged with other web resources with high pre-feedback based scores. After the ranking algorithm found enough web resources getting mostly positive user feedback, the ranking is not being changed anymore by two reasons: first, the ranking algorithm continues to receive only redundant confirmation of the top web resources' relatively high relevance, and, second, no web resources lacking prior-history features have scores higher than those which were lucky to get some.
That being said, pre-feedback information cannot fully reflect all the aspects of the web resources that potentially impact user satisfaction. Therefore, despite the fact that some web resources lacking user feedback can be more relevant than those ranked higher, these web resources are hardly displayed to the user performing a search.
Thus, the inaccurate display of search results can increase the repeated searches of the user, consequently resulting in increased consumption of energy and increased consumption of bandwidth.
US2011/0196733 (Li et al.) discloses a system that divides a ranked group of online messages into a first list, a second list, and a promotion set. Each message in the first list has a performance score that is greater than each performance score of messages in the second list and the promotion set. The system moves a message within the promotion set to a third list as a function of a confidence value and moves a message from one of the third list and the second list to the first list based on an experiment event outcome. The system transmits top messages in the first list over a network for display at a recipient computer (abstract).
US2014/0280548 (Langlois et al.) discloses a method and system for exploring a list of user interests beyond the currently known user interests by defining a distance metrics in the interest space. The new method and system and system target for exploration, items of interest which are close in proximity to the current set of user interests thereby greatly improving the chance that one of the exploration items will be liked by the user.
WO2013189261 (Ioannidis et al.) discloses a method of selection that maximizes an expected reward in a contextual multi-armed bandit setting gathers rewards from randomly selected items in a database of items, where the items correspond to arms in a contextual multi-armed bandit setting. Initially, an item is selected at random and is transmitted to a user device which generates a reward. The items and resulting rewards are recorded. Subsequently, a context is generated by the user device which causes a learning and selection engine to calculate an estimate for each arm in the specific context, the estimate calculated using the recorded items and resulting rewards. Using the estimate, an item from the database is selected and transferred to the user device. The selected item is chosen to maximize a probability of a reward from the user device.
U.S. Pat. No. 7,707,131 (Chickering et al.) discloses a system and method for online reinforcement learning. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model (e.g., Bayesian network model).
US20110264639 (Slivkins et al.) discloses a document selector that selects and ranks documents that are relevant to a query. The document selector executes an instance of a multi-armed bandits algorithm to select a document for each slot of a results page according to one or more strategies. The documents are selected in an order defined by the results page and documents selected for previous slots are used to guide the selection of a document for a current slot. If a document in a slot is subsequently selected, the strategy used to select the document is rewarded with positive feedback. When the uncertainty in an estimate of the utility of a strategy is less than the variation between documents associated with the strategy, the strategy is subdivided into multiple strategies. The document selector is able to “zoom in” on effective strategies and provide more relevant search results.
US20120016642 (Li et al.) discloses methods and apparatus for performing computer-implemented personalized recommendations are disclosed. User information pertaining to a plurality of features of a plurality of users may be obtained. In addition, item information pertaining to a plurality of features of the plurality of items may be obtained. A plurality of sets of coefficients of a linear model may be obtained based at least in part on the user information and/or the item information such that each of the plurality of sets of coefficients corresponds to a different one of a plurality of items, where each of the plurality of sets of coefficients includes a plurality of coefficients, each of the plurality of coefficients corresponding to one of the plurality of features. In addition, at least one of the plurality of coefficients may be shared among the plurality of sets of coefficients for the plurality of items. Each of a plurality of scores for a user may be calculated using the linear model based at least in part upon a corresponding one of the plurality of sets of coefficients associated with a corresponding one of the plurality of items, where each of the plurality of scores indicates a level of interest in a corresponding one of a plurality of items. A plurality of confidence intervals may be ascertained, each of the plurality of confidence intervals indicating a range representing a level of confidence in a corresponding one of the plurality of scores associated with a corresponding one of the plurality of items. One of the plurality of items for which a sum of a corresponding one of the plurality of scores and a corresponding one of the plurality of confidence intervals is highest may be recommended.
U.S. Pat. No. 8,001,001 (Brady et al.) discloses an improved system and method is provided for using sampling for allocating web page placements in online publishing of content. A multi-armed bandit engine may be provided for sampling content items by allocating web page placements of varying quality for content items and optimizing the payoff to maximize revenue. Publishers may provide content items to be published and report their valuation per click. Through a process of valuation discovery, the click-through rate for content items and the value of content items may be learned through sampling. As the process of valuation discovery progresses, the present invention may more closely approximate the click-through rates for content items in order to allocate web page placements to content items that may optimize content layout by maximizing revenue. The present invention may accurately learn the CTR for new content items and support multiple web page placements of varying quality.
U.S. Pat. No. 8,923,621 (Slaney et al.) relates to a software for initialized explore-exploit creates a plurality of probability distributions. Each of these probability distributions is generated by inputting a quantitative description of one or more features associated with an image into a regression model that outputs a probability distribution for a measure of engagingness for the image. Each of the images is conceptually related to the other images. The software uses the plurality of probability distributions to initialize a multi-armed bandit model that outputs a serving scheme for each of the images. Then the software serves a plurality of the images on a web page displaying search results, based at least in part on the serving scheme.
US20090043597 (Agarwal et al.) relates to an improved system and method for matching objects using a cluster-dependent multi-armed bandit is provided. The matching may be performed by using a multi-armed bandit where the arms of the bandit may be dependent. In an embodiment, a set of objects segmented into a plurality of clusters of dependent objects may be received, and then a two step policy may be employed by a multi-armed bandit by first running over clusters of arms to select a cluster, and then secondly picking a particular arm inside the selected cluster. The multi-armed bandit may exploit dependencies among the arms to efficiently support exploration of a large number of arms. Various embodiments may include policies for discounted rewards and policies for undiscounted reward. These policies may consider each cluster in isolation during processing, and consequently may dramatically reduce the size of a large state space for finding a solution.
US20100250523 (Jin et al.) relates to an improved system and method for learning a ranking model that optimizes a ranking evaluation metric for ranking search results of a search query is provided. An optimized nDCG ranking model that optimizes an approximation of an average nDCG ranking evaluation metric may be generated from training data through an iterative boosting method for learning to more accurately rank a list of search results for a query. A combination of weak ranking classifiers may be iteratively learned that optimize an approximation of an average nDCG ranking evaluation metric for the training data by training a weak ranking classifier at each iteration for each document in the training data with a computed weight and assigned class label, and then updating the optimized nDCG ranking model by adding the weak ranking classifier with a combination weight to the optimized nDCG ranking model.
U.S. Pat. No. 8,473,486 (He et al.) relates to a supervised technique uses relevance judgments to train a dependency parser such that it approximately optimizes Normalized Discounted Cumulative Gain (NDCG) in information retrieval. A weighted tree edit distance between the parse tree for a query and the parse tree for a document is added to a ranking function, where the edit distance weights are parameters from the parser. Using parser parameters in the ranking function enables approximate optimization of the parser's parameters for NDCG by adding some constraints to the objective function.