In recent years, A/B testing has become the state-of-the-art technique for improving web services based on data-driven decisions. They are utilized by many web and search engine companies, including search engines such as Bing and Google, social networks such as Facebook, etc. The largest web services have designed special experimental platforms that allow them to run A/B tests at large scale. An A/B test compares two variants of a service at a time, usually its current version (control) and a new one (treatment), by exposing them to two groups of users.
The aim of controlled experiments is to detect the causal effect of the system updates on its performance relying on a criterion that correlates with the quality of the system. A challenging problem is to choose an appropriate criterion applicable in practice, since it has to meet two crucial requirements, which often conflict.
First, the criterion should provide a quantitative value that allows making conclusions about the change in the system's quality, particularly, about the sign and magnitude of that change. In other words, the value of the criterion must have a clear interpretation. It is known in the art that many criteria may result in contradictory interpretations and their use in practice may be misleading and, therefore, the right choice of an appropriate criterion is a difficult task.
Second, when a treatment effect exists (e.g., effect of modifications on the user behavior), the criterion has to detect the difference of the two versions of the system at a high level of statistical significance in order to distinguish the treatment effect from the noise observed when the effect does not exist. This property is referred to as the sensitivity of the metric. The common problem is the low metric sensitivity in the cases when only a subtle modification is being tested or if only a small amount of user traffic is affected by the system change.
The state-of-the-art criteria for evaluating the performance of the two versions are generally based on mean values of the user behavior metrics. However, a fundamental disadvantage of these criteria is that the mean values of the user behavior metrics may not necessarily change, even if their distributions change significantly.
In the case of a search engine, there is a variety of different components, and their modifications may affect the distribution of a user behavior metric differently. The most important components of a Search Engine Results Page (SERP) are those which normally present data from several sources: organic search results, advertising results, vertical results, and others. If an update affects only the advertising results, then it is difficult to assess the total quality of the whole SERP, because advertising makes less than 10% in the search engine traffic. Therefore, the problem of low sensitivity of the appropriate criterion becomes particularly acute in this case.