Businesses are increasingly using testing in order to learn about an initiative's effectiveness. The businesses will try an idea in a subset of the business and then compare how that tested subset's performance change over time compares to the non-test subsets of the business over the same time to measure the impact of the test and inform “go forward” decisions.
In one conventional example, the business is a retailer, and the subset of the business in which the test occurs is a set of test stores, which are then compared to control stores. The difference in the change in performance in test stores when the tested program is implemented to the change in performance in control stores over the same time is then attributed as the impact of the tested program.
In order to have an accurate measurement of test effectiveness, the measurement approach needs to set an effective baseline for how the test stores would have performed had they not received the test initiative. In other words, it is desirable to have a measurement approach such that if a program were to have no effect, the approach would measure no difference in the change in performance between the test and control stores. Therefore, it is desirable to pick the time periods, methods for handling seasonality, methods for handling “outliers” (unusual performance points), and strategy for selecting control stores that set the most accurate baseline for how the test stores would have performed had they not implemented the initiative being measured.
As an example, one aspect of the measurement approach is selecting the control stores that did not receive the program that will be used to compare change in performance against the change in performance in the test stores. How control stores are selected can be an important, but difficult, question. Control stores could be paired to test stores based on their similarity to the test store on certain matching characteristics. A variety of characteristics could be considered to be most relevant in selecting the control stores that are most similar to each test store. For example, control stores could be matched to test stores based on similar financial patterns (e.g., sales), such as seasonality (variation due to time of year), slope (general linear trend during a period of time), and volume. Store attributes such as demographics, and physical location are time-invariant characteristics that can be used for store matching. From a candidate control pool, control stores are found for each test store by calculating a distance measure that is composed of financial patterns and/or store attributes and finding the stores with the minimum distance to the test stores. The set of features to match on, called a control strategy, can be any number of features and weighted in any arbitrary manner.
The test for similarity is related to the strength of the null hypothesis. The null hypothesis asserts that given no action is being performed on a set of stores; there is no difference in performance of the test stores and their respective baseline control stores. In practice, the aggregate performance of a set of control stores should be as close as possible to the test store.
The current method of validating the “goodness” of the measurement strategy is by finding the variance of test versus control performance across a clean period when no action has occurred on both the test stores and control stores. Basically, a test can be simulated in which nothing has actually occurred in order to measure adherence to the null hypothesis. This involves randomly designating test stores from a clean group to set up tests that occur at random times. But nothing actually happened in these stores—they are arbitrarily and randomly designated as a test group. The measurement approach is then used to measure change in performance in the designated test group relative to control over the designated time periods. For example, if assessing K different strategies for picking a control group, a simulation program can apply the first strategy for selecting a control group, build the control group, and measure change in performance in the test group relative to the control group. This is repeated for each strategy for picking a control group being assessed, and the results are recorded. Once complete, it will start again and randomly pick a set of designated test stores and designated test dates for each test store. And again, each control strategy is applied and a change in performance is measured for the test group against the control group, and the results are recorded. This process can repeat several hundred times. The process will then calculate the results for each strategy across the several hundred simulations to assess which approach worked best.
The simulation program can evaluate the accuracy of each possible measurement approach against several dimensions. The first is how close to 0 the measurement approach was across the simulations. Because this is a null test, there should have been a measurement of no change in the designated test group relative to the designated control group. Therefore, the closer a measurement approach is to 0 is indicative of a more accurate measurement strategy. Another metric is average pre-period noise. Pre-period noise for a given run of simulation is a measure of how closely the model predicts the variation of the test stores in the pre-period. Strong control strategies can accurately predict each point of the test stores in the pre-period. Although some features are biased to performed better in pre-period noise and using this metric can lead to over-fit, this metric can provide a general indicator of how well the control strategy is performing.
The two main challenges of using simulations to search and validate control strategies is that it involves large volumes of data and a large number of computations since the space of control strategies is extremely large. Financial data needs to be extracted for long timeframes multiple times for many stores. The processing of the data to determine the test and control, pre versus post performance is computationally intensive. Simulations can take several hours to days to complete to evaluate one control strategy. The number of control strategies can be any combination of matching criteria that meets business intuition. Each matching criteria in a control strategy can be weighted differently to create even more control strategies.