In recent years, A/B testing has become the state-of-the-art technique for improving web services based on data-driven decisions. An A/B test compares two variants of a service at a time, usually its current version (a control version or version A) and a new one (a test or treatment version having an experimental treatment applied thereto or version B), by exposing them to two groups of users. They are utilized by many web services providers, including for example search engines, e-commerce sites and social networks, such as Amazon™, eBay′Facebook™, Google™, LinkedIn™, Microsoft™, Netflix™, Yahoo™ and Yandex™. The largest web services have designed special experimental platforms that allow them to run A/B tests at large scale.
One aim of the controlled A/B experiments is to detect the causal effect on user engagement of experimental treatments applied to the web service. A challenging problem is to choose an appropriate criterion applicable in practice, since it has to meet two crucial requirements, which often conflict.
First, the criterion should provide a quantitative value that allows making conclusions about the change in the system's quality, particularly, about the sign of that change. In other words, the value of the criterion must have a clear interpretation and be consistent with user preferences. This criterion is referred to as the directionality or interpretability of the metric. It is known in the art that many criteria may result in contradictory interpretations and their use in practice may be misleading and, therefore, the right choice of an appropriate criterion is a difficult task.
Second, when a treatment effect exists (e.g., effect of modifications on the user behavior), the criterion has to detect the difference of the two versions of the system at a high level of statistical significance in order to distinguish the treatment effect from the noise observed when the effect does not exist. This property is referred to as the sensitivity of the metric. The common problem is the low metric sensitivity in the cases when only a subtle modification is being tested or if only a small amount of user traffic is affected by the system change.
U.S. Pat. No. 8,396,875 titled “Online stratified sampling for classifier evaluation” by Bennett et al. teaches to determine if a set of items belongs to a class of interest, where the set of items is binned into sub-populations based on a score, ranking, or trait associated with each item. The sub-populations may be created based on the score associated with each item, such as an equal score interval, or with the distribution of the items within the overall population, such as a proportion interval. A determination is made of how may samples are needed from each sub-population in order to make an estimation regarding the entire set of items. Then a calculation of the precision and variance for each sub-population is completed and are combined to provide an overall precision and variance value for the overall population.
U.S. Patent Publication No. 2016/253311 titled “Most impactful experiments” by Xu et al. teaches techniques for conducting A/B experimentation of online content. According to various embodiments, a user specification of a metric associated with operation of an online social networking service is received. A set of one or more A/B experiments of online content is then identified, each A/B experiment being targeted at a segment of members of the online social networking service. Thereafter, each of the A/B experiments is ranked, based on an inferred impact on the value of the metric in response to application of a treatment variant of each A/B experiment to the online social networking service. A list of one or more of the ranked A/B experiments is then displayed, via a user interface displayed on a client device.