Monday, November 9, 2015

A/B Testing- An Introduction

AB Testing
Overview:
A/B Testing, at high level, is to compare two versions of a change on a web page. This change could be a minor change such as changing the button color or this change can be somewhat significant such as allowing free access to resources of a content based online company. In AB testing there are always two variants , one without the change and one with the change. These two variants of the web page are called A and B, terms in AB testing. The basic idea behind AB testing is to show both variants of the webpage to similar audience (not the same) and seeing which version of web page gives better results.

Figure:1[i]
The expected results could vary based upon the business objectives for conducting the AB Test. For example: the business objective could just be improving the click through probability for a certain button on the web page. In another example, the business objective could be to increase the revenues by making more people buy the service or a given product. 
                    In an AB test , the traffic is divided in to two parts , Control Group and Experiment Group. Control Group gets to see the page without the change and experiment group gets to see the page with the change. The choice of how much traffic should be assigned to control vs experiment groups is subjective to business objectives and considerations.
Though , it looks straight forward , but it requires good understanding of the mathematical and statistical concepts  behind conducting an AB Test and analyzing the results to make correct recommendation. 


 Example :


Figure: 2[i]




Let’s unpack it little more to see what are the main design considerations behind conducting an AB Test.

How to Design an AB Test ?
It is very important to design the AB test correctly in order to avoid any confusion of incorrect interpretation at later stages. At high level, following are the main design components of an AB Test
  • Generate a Hypothesis
  • Metric Choice
  • Sizing
  • Duration
  • Sanity Checks
  • Result Analysis
  • Making a decision or recommendation.

Let’s try to understand each of these design phases.

Generate a Hypothesis

First step for conducting a given AB test is to understand what you want to test and formulate a hypothesis. This hypothesis would then be used to decide whether our test gave expected results or not.
For example: An online education company tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that these courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.
Hypothesis for this test can be that this change might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, company could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.
Metric Choice:
The most important part of any AB test is to identify correct Evaluation and Invariant Matrices.
Evaluation metric defines the parameters that are expected to change between the control and the experiment group. Invariant matrices, used for sanity checking(explained in later section) , defines the parameters that are not expected to change between the control group and experiment group. For example : Let’s consider a change in button color (Start Now). The business objective is make more user click on start Now button. In this case , evaluation metric can be Click through probability i.e total click on start now / Total number of page views. Since , as per the hypothesis , changing the color of the button will have significant impact on users clicking start now button. This metric would be good to measure the change.
On the other hand, Invariant metric in this experiment could be total page views. Since, total page views are not impacted by the color of start now button, this metric should not change between control and experiment group. In other words, how many users arrive on the page (where start now button is located) is not impacted by choice of color. However, clicking on that button might be impacted, as mentioned earlier.
Choosing right evaluation and invariant metric is the most crucial part of AB testing, more crucial than the actual change itself. Depending on the business  objective, there are several choices for these matrices. Let’s consider an example and see what are the available choices for matrices.
In the above example , mentioned in previous section , we can choose following matrices:
  • Invariant Metrics: Number of Cookies, Number of Clicks
  • Evaluation Metrics: Gross Conversion, Retention, Net Conversion

Lets see why we choose these metrices out of following choices:

Number of Cookies:
It is the number of unique cookies to visit the course overview page.
Since the Unit of Diversion is cookie and the number of cookies is not going to be affected by the change that company is launching at the time of enrolment. Therefore, number of cookies is well suited for invariant matric because there is no impact on the number of cookies

Number of Clicks:
Number of unique cookies to click the start free trial button.
Since, the page asking for the number of hours the student can devote to the course appeared after clicking the “ Start free Trial ” button, the course overview page remains the same for both the control and experiment group.

Number of user-ids:

                As per the experiment, the new pop up message is likely to affect the total number of user ids who enrol in the program. For this reason , this matric cannot be used an invariant matric because it is more likely to differ in control and experiment group.

Click through probability :
That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.
Since, the page asking for the number of hours the student can devote to the course appeared after clicking the “ Start free Trial ” button, the click through probability should remain the same for both the control and experiment group. Therefore, it can be chosen as invariant matric.
Gross Conversion:
Number of users who enrolled in the free trial / Number of users who clicked the Start N Free Trial Button
After clicking the Start Free Trial button, a popup appears for the Experiment group users asking for the amount of time that the Student can devote to the course. Based on the user choice it then makes a suggestion whether the student should enroll for the course or continue with the free course material.In the experiment group the user can make a decision based on the pop message and choose to continue exploring the material only. But for the control group no popup page appears, thus the user enrols for the course anyways. Hence, the gross conversion can be different in both control and experiment group hence can be used as an evaluation matric.
Retention:
Number of user − ids to remain enrolled for 14 days trial period and make their first payment /
Number of users who enrolled in the free trial.
Retention ratio can also be a good evaluation matric because the retention ratio in experiment group is expected to be higher because of low enrolment , if experiment meets the assumption. After seeing the message , there might be fewer users who would enrol and hence retention would be higher because it should filter out those users who can leave the course as frustrated user. Therefore this ratio can be  used as evaluation matric because ratio in control group and experiment group would be different and for the same reason it cannot be chosen as invariant matric.

Net Conversion:
Number of user − ids to remain enrolled for the 14 days trial period and make their first payment /Number of users who clicked the Start Free Trial button

As per the intention and assumption of the experiment , experiment group users are made aware that the course require some minimum of hours each week by showing the pop message at the time of enrolment. This message should filter out those users who cannot devote the required hours and are prone to be frustrated later on. This ratio should be different among control group users and experiment group users, if experiment’s assumptions holds true.Hence , it can be used as evaluation matric and for the same reason it cannot be chosen as invariant matric.

Gross Conversion, will show us whether we lower our costs by introducing the new pop up. Net conversion will show how the change affects our revenues. After the experiment, we expect that, Gross conversion should have a practically significant decrease, and Net conversion should have a statistically significant increase.

Number of Samples vs. Power:

Next step is to size the experiment, which means, given the set of probabilities (history values), how many days would be required to conduct the experiment. Though, there are numerous ways to calculate how many days it will take to complete the experiment based on business objectives , I have shown one example for the above scenario using an online calculator.
Using the online calculator , we calculated number of samples required as following:
Probability of enrolling, given click:
20.625% base conversion rate, 1% min d.
Samples needed: 25,835
Probability of payment, given enroll:
53% base conversion rate, 1% min d.
Samples needed: 39,115
Probability of payment, given click:
10.93125% base conversion rate, 0.75% min d.
Samples needed: 27,413
However, Number of samples calculated could be different if your click through probability is less so these numbers need to be adjusted based on the probability.

Duration vs. Exposure

Once you know how many samples are needed for the experiment and how many days it will take to finish the experiment? Next, you need to decide what percentage of traffic you want to put in control vs. experiment group. This could vary from 50-50 % to 20-80 % based on the risks and other considerations.
In the above example, Number of days needed to perform the experiment:
685325 page views / 20,000 page views per day = 35 days approx. (The numbers comes from the data collected from the experiment. These number are shown here for demonstration only.)
We can see that 35 days are a long period of time, so we may have to rethink about our decision, which is very subjective and can depend on various factors. In any given situation , this decision is not in the hands of experiment designer only , there should be other groups involved and should take this decision on a mutual consent .This decision should align with the business objectives. For now, we can assume that we can conduct this experiment for 35 days.

Experiment Analysis                                                                                 

Sanity Checks
Once you have the results, you need to do couple of sanity checks. What are sanity checks? Hmm , let’s see what are they. Remember, we discussed about the invariant matrices. Sanity checks ensures that nothing went wrong in your experiment, it is kind of security valve in AB testing. For example: In a given AB test, the counts of click could be incorrect due to some JavaScript bug and if this sort of thing is not identified then the results are of no use.
It is therefore necessary to do the sanity checks by analyzing the invariant matrices before jumping to analyse the evaluation matrices. Invariant matrices, by definition, should not change among control and experiment group.  I would not get in to the  mathematics of calculating the confidence intervals of Invariant matrices and checking if the observed values are within range or not.
If you want to see the details you can refer to the detailed report on my github profile.

Result Analysis
Last but not the least, is Result analysis step. In this step we analyse the evaluation matrices to see if we get expected results or not.  The observed results have to be both statistically and practically significant in order to be confident that the given experiment has expected outcomes.


Recommendation
Based on the results, it is time to make the recommendation whether to launch or not launch the change. If we are considering multiple evaluation matrices then it is very important to make sure that all the matrices are both statistically and practically significant. In our above example,
Gross Conversion evaluation matrix is both statistically and practically significant, which means that the change(pop up) has positive impact on experiment user group i.e reducing the number of users enrolling after viewing the message. As expected, it should reduce number of frustrated users , who leaves the course in the middle. This should also allow the coaches to spend more time helping those students, who are really likely to complete the course. However, Net conversion evaluation matric results were neither statistically significant nor practically significant. This means that there is a risk that the introduction of the trial screener may lead to a decrease in revenue and company should consider test other designs before deciding whether to release the feature, or abandon the idea entirely.
In this case ,online education company should not launch this change even though company could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course .This decision is in line with the hypothesis that the results might set clearer expectations for students upfront but can impact revenues(net conversion matric). From the experiment results, we can see that second part of hypothesis “without significantly reducing the number of students to continue past the free trial and eventually complete the course” , which pertain to Net conversion metric does not hold true. Hence, it is not recommended to take the risk and launch the change.




Note: If you want to know more about the calculation and the data used in this article. You can click here to view the complete report for the case study




[1] https://s3.amazonaws.com/snaprojects/blog/recsysab/1_Diane_RecSys+AB+Workshop+Oct+2014+--+Shared+with+External+Organizers.pdf

No comments:

Post a Comment