AB Testing
Overview:
A/B Testing, at high level, is to compare two versions of a
change on a web page. This change could be a minor change such as changing the button color or this change can be somewhat significant such as allowing free access to resources of a content based online company. In AB testing there are always two variants , one without the change and one with the change. These two variants of the web page are called A and B, terms in AB testing. The basic idea behind AB testing is to show
both variants of the webpage to similar audience (not the same) and seeing
which version of web page gives better results.
Figure:1[i]
The expected results could vary based upon the business
objectives for conducting the AB Test. For example: the business objective
could just be improving the click through probability for a certain button on
the web page. In another example, the business objective could be to increase
the revenues by making more people buy the service or a given product.
In an AB test , the
traffic is divided in to two parts , Control Group and Experiment
Group. Control Group gets to see the
page without the change and experiment group gets to see the
page with the change. The choice of how much traffic should be assigned to
control vs experiment groups is subjective to business objectives and
considerations.
Though , it looks straight forward , but it requires good
understanding of the mathematical and statistical concepts behind conducting an AB Test and analyzing the
results to make correct recommendation.
Figure: 2[i]
How to Design an AB
Test ?
It is very important to design the AB test correctly in
order to avoid any confusion of incorrect interpretation at later stages. At
high level, following are the main design components of an AB Test
- Generate a Hypothesis
- Metric Choice
- Sizing
- Duration
- Sanity Checks
- Result Analysis
- Making a decision or recommendation.
Let’s try to understand each of these design phases.
Generate a Hypothesis
First step for
conducting a given AB test is to understand what you want to test and formulate
a hypothesis. This hypothesis would then be used to decide whether our test
gave expected results or not.
For example: An online education
company tested a change where if the student clicked "start free
trial", they were asked how much time they had available to devote to the
course. If the student indicated 5 or more hours per week, they would be taken
through the checkout process as usual. If they indicated fewer than 5 hours per
week, a message would appear indicating that these courses usually require a greater
time commitment for successful completion, and suggesting that the student
might like to access the course materials for free.
Hypothesis for
this test can be that this change might set clearer expectations for students
upfront, thus reducing the number of frustrated students who left the free
trial because they didn't have enough time—without
significantly reducing the number of students to continue past the free trial
and eventually complete the course. If this hypothesis held true, company could
improve the overall student experience and improve coaches' capacity to support
students who are likely to complete the course.
Metric Choice:
The most important part of any AB test is to identify
correct Evaluation and Invariant Matrices.
Evaluation metric defines the parameters that are expected
to change between the control and the experiment group. Invariant matrices,
used for sanity checking(explained in later section) , defines the parameters
that are not expected to change between the control group and experiment group.
For example : Let’s consider a change in button color (Start Now). The
business objective is make more user click on start Now button. In this case ,
evaluation metric can be Click through probability i.e total click on start now
/ Total number of page views. Since , as per the hypothesis , changing the
color of the button will have significant impact on users clicking start now
button. This metric would be good to measure the change.
On the other hand, Invariant metric
in this experiment could be total page views. Since, total page views are not
impacted by the color of start now button, this metric should not change between
control and experiment group. In other words, how many users arrive on the page
(where start now button is located) is not impacted by choice of color.
However, clicking on that button might be impacted, as mentioned earlier.
Choosing right evaluation and invariant metric is the
most crucial part of AB testing, more crucial than the actual change itself.
Depending on the business objective, there are several choices for these matrices. Let’s consider an
example and see what are the available choices for matrices.
In the above example , mentioned in previous section , we can choose following matrices:
- Invariant Metrics: Number of Cookies, Number of Clicks
- Evaluation Metrics: Gross Conversion, Retention, Net Conversion
Lets see why we choose these metrices out of following choices:
Number of
Cookies:
It is the number of
unique cookies to visit the course overview page.
Since the Unit of Diversion is cookie and the number
of cookies is not going to be affected by the change that company is launching
at the time of enrolment. Therefore, number of cookies is well suited for invariant
matric because there is no impact on the number of cookies
Number of
Clicks:
Number of unique
cookies to click the start free trial button.
Since, the page asking for the number of hours the
student can devote to the course appeared after clicking the “ Start free Trial
” button, the course overview page remains the same for both the control and
experiment group.
Number
of user-ids:
As
per the experiment, the new pop up message is likely to affect the total number
of user ids who enrol in the program. For this reason , this matric cannot be
used an invariant matric because it is more likely to differ in control and
experiment group.
Click through probability
:
That is, number of unique cookies to click the
"Start free trial" button divided by number of unique cookies to view
the course overview page.
Since, the page asking for the number of hours the
student can devote to the course appeared after clicking the “ Start free Trial
” button, the click through probability should remain the same for both the
control and experiment group. Therefore, it can be chosen as invariant matric.
Gross
Conversion:
Number of users who
enrolled in the free trial / Number of users who clicked the Start N Free Trial
Button
After clicking the Start Free Trial button, a popup appears
for the Experiment group users asking for the amount of time that the Student
can devote to the course. Based on the user choice it then makes a suggestion
whether the student should enroll for the course or continue with the free course
material.In the experiment group the user can make a decision based on the pop
message and choose to continue exploring the material only. But for the control
group no popup page appears, thus the user enrols for the course anyways.
Hence, the gross conversion can be different in both control and experiment
group hence can be used as an evaluation matric.
Retention:
Number of user − ids
to remain enrolled for 14 days trial period and make their first payment /
Number of users who
enrolled in the free trial.
Retention ratio can also be a good evaluation matric
because the retention ratio in experiment group is expected to be higher
because of low enrolment , if experiment meets the assumption. After seeing the
message , there might be fewer users who would enrol and hence retention would
be higher because it should filter out those users who can leave the course as
frustrated user. Therefore this ratio can be
used as evaluation matric because ratio in control group and experiment
group would be different and for the same reason it cannot be chosen as
invariant matric.
Net Conversion:
Number of user − ids
to remain enrolled for the 14 days trial period and make their first payment
/Number of users who clicked the Start Free Trial button
As per the intention and assumption of the experiment
, experiment group users are made aware that the course require some minimum of
hours each week by showing the pop message at the time of enrolment. This
message should filter out those users who cannot devote the required hours and
are prone to be frustrated later on. This ratio should be different among
control group users and experiment group users, if experiment’s assumptions
holds true.Hence , it can be used as evaluation matric and for the same reason
it cannot be chosen as invariant matric.
Gross
Conversion, will show us whether we lower our costs by introducing the new pop
up. Net conversion will show how the change affects our revenues. After the
experiment, we expect that, Gross conversion should have a practically
significant decrease, and Net conversion should have a statistically
significant increase.
Number of Samples vs. Power:
Next step is to size the experiment,
which means, given the set of probabilities (history values), how many days
would be required to conduct the experiment. Though, there are numerous ways to
calculate how many days it will take to complete the experiment based on
business objectives , I have shown one example for the above scenario using an
online calculator.
Using the online
calculator , we calculated number of samples required as following:
Probability of enrolling, given click:
20.625% base conversion rate, 1% min d.
Samples needed: 25,835
20.625% base conversion rate, 1% min d.
Samples needed: 25,835
Probability of payment, given enroll:
53% base conversion rate, 1% min d.
Samples needed: 39,115
53% base conversion rate, 1% min d.
Samples needed: 39,115
Probability of payment, given click:
10.93125% base conversion rate, 0.75% min d.
Samples needed: 27,413
10.93125% base conversion rate, 0.75% min d.
Samples needed: 27,413
However, Number of samples
calculated could be different if your click through probability is less so these
numbers need to be adjusted based on the probability.
Duration vs.
Exposure
Once you know how many samples are needed for the experiment
and how many days it will take to finish the experiment? Next, you need to
decide what percentage of traffic you want to put in control vs. experiment
group. This could vary from 50-50 % to 20-80 % based on the risks and other
considerations.
In the above example, Number of
days needed to perform the experiment:
685325 page views / 20,000 page views
per day = 35 days approx. (The numbers comes from the data collected from the
experiment. These number are shown here for demonstration only.)
We can see that 35 days are a
long period of time, so we may have to rethink about our decision, which is
very subjective and can depend on various factors. In any given situation ,
this decision is not in the hands of experiment designer only , there should be
other groups involved and should take this decision on a mutual consent .This
decision should align with the business objectives. For now, we can assume that
we can conduct this experiment for 35 days.
Experiment Analysis
Once you
have the results, you need to do couple of sanity checks. What are sanity
checks? Hmm , let’s see what are they. Remember, we discussed about the
invariant matrices. Sanity checks ensures that nothing went wrong in your experiment,
it is kind of security valve in AB testing. For example: In a given AB test, the
counts of click could be incorrect due to some JavaScript bug and if this sort
of thing is not identified then the results are of no use.
It is
therefore necessary to do the sanity checks by analyzing the invariant matrices
before jumping to analyse the evaluation matrices. Invariant matrices, by definition,
should not change among control and experiment group. I would not get in to the mathematics of calculating the confidence
intervals of Invariant matrices and checking if the observed values are within
range or not.
If you want to see the details you can
refer to the detailed report on my github profile.
Last but not
the least, is Result analysis step. In this step we analyse the evaluation
matrices to see if we get expected results or not. The observed results have to be both statistically
and practically significant in order to be confident that the given experiment
has expected outcomes.
Based on the results, it is time to
make the recommendation whether to launch or not launch the change. If we are
considering multiple evaluation matrices then it is very important to make sure
that all the matrices are both statistically and practically significant. In
our above example,
Gross Conversion
evaluation matrix is both statistically and practically significant, which
means that the change(pop up) has positive impact on experiment user group i.e
reducing the number of users enrolling after viewing the message. As expected,
it should reduce number of frustrated users , who leaves the course in the
middle. This should also allow the coaches to spend more time helping those
students, who are really likely to complete the course. However, Net conversion
evaluation matric results were neither statistically significant nor
practically significant. This means that there
is a risk that the introduction of the trial screener may lead to a decrease in
revenue and company should consider test other designs before deciding
whether to release the feature, or abandon the idea entirely.
In this case ,online education
company should not launch this change even though company could improve the
overall student experience and improve coaches' capacity to support students
who are likely to complete the course .This decision is in line with the
hypothesis that the results might set clearer
expectations for students upfront but can impact revenues(net conversion
matric). From the experiment results, we can see that second part of hypothesis
“without significantly reducing the
number of students to continue past the free trial and eventually complete the
course” , which pertain to Net conversion metric does not hold true. Hence,
it is not recommended to take the risk and launch the change.
Note: If you want to know more about
the calculation and the data used in this article. You can click here to view the complete report for the case study
[1] https://s3.amazonaws.com/snaprojects/blog/recsysab/1_Diane_RecSys+AB+Workshop+Oct+2014+--+Shared+with+External+Organizers.pdf
No comments:
Post a Comment