In recent years, A/B testing has become the state-of-the-art technique for improving web services based on data-driven decisions. An A/B test compares two variants of a service at a time, usually its current version (a control version) and a new one (a test version having an experimental treatment applied thereto), by exposing them to two groups of users. They are utilized by many web services providers, including for example search engine companies and social networks. The largest web services have designed special experimental platforms that allow them to run A/B tests at large scale.
One aim of the controlled A/B experiments is to detect the causal effect on user engagement of experimental treatments applied to the web service. A challenging problem is to choose an appropriate criterion applicable in practice, since it has to meet two crucial requirements, which often conflict.
First, the criterion should provide a quantitative value that allows making conclusions about the change in the system's quality, particularly, about the sign and magnitude of that change. In other words, the value of the criterion must have a clear interpretation. It is known in the art that many criteria may result in contradictory interpretations and their use in practice may be misleading and, therefore, the right choice of an appropriate criterion is a difficult task.
Second, when a treatment effect exists (e.g., effect of modifications on the user behavior), the criterion has to detect the difference of the two versions of the system at a high level of statistical significance in order to distinguish the treatment effect from the noise observed when the effect does not exist. This property is referred to as the sensitivity of the metric. The common problem is the low metric sensitivity in the cases when only a subtle modification is being tested or if only a small amount of user traffic is affected by the system change.
The state-of-the-art criteria for evaluating the performance of the two versions are generally based on mean values of the user behavior metrics. However, a disadvantage of these criteria is that the mean values of the user behavior metrics, when averaged over an experimental period, may not reflect a general trend over an experimental period of the user engagement.