In digital medium environments, service providers strive to provide digital content that is of interest to users. An example of this is digital content used in a marketing context in order to increase a likelihood of conversion of the digital content. Examples of conversion include interaction of a user with the content (e.g., a “click-through), purchase of a product or service that pertains to the digital content, and so forth. A user, for instance, may navigate through webpages of a website of a service provider. During this navigation, the user is exposed to an advertisement relating to the product or service. If the advertisement is of interest to the user, the user may select the advertisement to navigate to webpages that contain more information about the product or service that is a subject of the advertisement, functionality usable to purchase the product or service, and so forth. Each of these selections thus involves conversion of interaction of the user with respective digital content into other interactions with other digital content and/or even purchase of the product or service. Thus, configuration of the advertisements in a manner that is likely to be of interest to the users increases the likelihood of conversion of the users regarding the product or service.
In another example of digital content and conversion, users may agree to receive emails or other electronic messages relating to products or services provided by the service provider. The user, for instance, may opt-in to receive emails of marketing campaigns corresponding to a particular brand of product or service. Likewise, success in conversion of the users towards the product or service that is a subject of the emails directly depends on interaction of the users with the emails. Since this interaction is closely tied to a level of interest the user has with the emails, configuration of the emails also increases the likelihood of conversion of the users regarding the product or service.
Testing techniques have been developed in order to determine a likelihood of which items of digital content are of interest to users. An example of this is A/B testing in which different items of digital content are provided to different sets of users. An effect of the different items of the digital content on conversion by the different sets is then compared to determine a likelihood of which of the items has a greater likelihood of being of interest to users, e.g., resulting in conversion.
A/B testing involves comparison of two or more options, e.g., a baseline digital content option “A” and an alternative digital content option “B.” In a marketing scenario, the two options include different digital marketing content such as advertisements having different offers, e.g., digital content option “A” may specify 20% off this weekend and digital content option “B” may specify buy one/get one free today.
Digital content options “A” and “B” are then provided to different sets of users, e.g., using advertisements on a webpage, emails, and so on. Testing may then be performed through use of a hypothesis. Hypothesis testing involves testing validity of a claim (i.e., a null hypothesis) that is made about a population in order to reject or prove the claim. For example, a null hypothesis “H0” may be defined in which a conversion rate of the baseline is equal to a conversion rate of the alternative, i.e., “H0: A=B”. An alternative hypothesis “H1” is also defined in which the conversion rate of the baseline is not equal to the conversion rate of the alternative, i.e., “H1: A≠B.”
Based on the response from these users, a determination is made whether to reject or not reject the null hypothesis. Rejection of the null hypothesis indicates that a difference has been observed between the options, i.e., the null hypothesis that both options are equal is wrong. This rejection takes into account accuracy guarantees that Type I and/or Type II errors are minimized within a defined level of confidence, e.g., to ninety-five percent confidence that these errors do not occur. A Type I error “α” is the probability of rejecting the null hypothesis when it is in fact correct, i.e., a “false positive.” A Type II error “β” is the probability of not rejecting the null hypothesis when it is in fact incorrect, i.e., a “false negative.” From this, a determination is made as to which of the digital content options are the “winner” based on a statistic, e.g., a conversion rate.
A common form of A/B testing is referred to as fixed-horizon hypothesis testing. In fixed-horizon hypothesis testing, inputs are provided manually by a user which are then “run” over a defined number of samples (i.e., the “horizon”) until the test is completed. These inputs include a confidence level that refers to a percentage of all possible samples that can be expected to include the true population parameter, e.g., “1−Type I error” which is equal to “1−α”. The inputs also include a power (i.e., statistical power) that defines a sensitivity in a hypothesis test that the test correctly rejects the null hypothesis, e.g., a false negative which may be defined “1−Type II error” which is equal to “1−β”. The inputs further include a baseline conversion rate (e.g., “μA”) which is the statistic being tested in this example. A minimum detectable effect (MDE) is also entered as an input that defines a “lift” that can be detected with the specified power and defines a desirable degree of insensitivity as part of calculation of the confidence level. Lift is formally defined based on the baseline conversion rate as “|μB−μA|/μA.”
From these inputs, a horizon “N” is calculated that specifies a sample size per option (e.g., a number of visitors per digital content options “A” or “B”) required to detect the specified lift of the MDE with the specified power. Based on this horizon “N”, the number “N” samples are collected (e.g., visitors per offer) and the null hypothesis H0 is rejected if “ΛN≥γ,” where “ΛN” is the statistic being tested at time “N” and “γ” is a decision boundary that is used to define the “winner” subject to the confidence level.
Fixed-horizon hypothesis testing has a number of drawbacks. In a first example drawback, a user that configures the test is forced to commit to a set amount of the minimum detectable effect before the test is run. Further, this commitment may not be changed as the test is run. However, if such a minimal detectable effect is overestimated, this test procedure is inaccurate in the sense that it possesses a significant risk of missing smaller improvements. If underestimated, this procedure is data-inefficient because a greater amount of time may be consumed to process additional samples in order to determine significance of the results.
In a second example drawback, fix-horizon hypothesis testing is required to run until the horizon “N” is met, e.g., a set number of samples is collected and tested. To do otherwise introduces errors, such as to violate a guarantee against Type I errors. For example, as the test is run, the results may fluctuate above and below a decision boundary that is used to reject a null hypothesis. Accordingly, a user that stops the test in response to these fluctuations before reaching the horizon “N” may violate a Type I error guarantee, e.g., a guarantee that at least a set amount of the calculated statistics do not include false positives.