You are on page 1of 13

Emily Glassberg Sands Follow

Head of Data Science @Coursera, Harvard Econ PhD


Mar 25, 2016 18 min read

5 Tricks When AB Testing Is O The Table


An applied introduction to causal inference in tech

https://unsplash.com/photos/9dI3g8owHiI

This article was co-authored with Duncan Gilchrist. Sample code, along
with basic simulation results, is available on GitHub.

At least a dozen times a day we ask, Does X drive Y? Y is generally


some KPI our companies care about. X is some product, feature, or
initiative.

The rst and most intuitive place we look is the raw correlation. Are
users engaging with X more likely to have outcome Y? Unfortunately,
raw correlations alone are rarely actionable. The complicating factor
here is a set of other features that might a ect both X and Y. Economists
call these confounding variables.

In ed-tech, for example, we want to spend our energy building products


and features that help users complete courses. Does our mobile app
meet that criteria? Certainly the raw correlation is there: users who
engage more with the app are on average more likely to complete. But
there are also important confounders at play. For one, users with
stronger innate interest in the product are more likely to be multi-
device users; they are also more likely to make the investments
required to complete the course. So how can we estimate the causal
e ect of the app itself on completion?

The knee-jerk reaction in tech is to get at causal relationships by


running randomized experiments, commonly referred to as AB tests.
AB testing is powerful stu : by randomly assigning some users and not
others an experience, we can precisely estimate the causal impact of
the experience on the outcome (or set of outcomes) we care about. Its
no wonder that for many experiences di erent copy or color, a new
email campaign, an adjustment to the onboarding ow AB testing is
the gold standard.

For some key experiences, though, AB tests can be costly to implement.


Consider rolling back your mobile apps full functionality from a
random subset of users. It would be confusing for your users, and a hit
to your business. On sensitive product dimensions pricing, for
example AB testing can also hurt user trust. And if tests are perceived
as unethical, you might be risking a full-on PR disaster.

Heres the good news: just because we cant always AB test a major
experience doesnt mean we have to y blind when it matters most. A
range of econometric methods can illuminate the causal relationships
at play, providing actionable insights for the path forward.

Econometric Methods

First, a quick recap of the challenge: we want to know the e ect of X on


Y, but there exists some set of confounding factors, C, that a ects both
the input of interest, X, and the outcome of interest, Y. In Stats 101, you
might have called this omitted variable bias.

The solution is a toolkit of ve econometric methods we can apply to


get around the confounding factors and credibly estimate the causal
relationship:

1. Controlled Regression

2. Regression Discontinuity Design

3. Di erence-in-Di erences

4. Fixed-E ects Regression


5. Instrumental Variables
(with or without a Randomized Encouragement Trial)

This post will be applied and succinct. For each method, well open
with a high-level overview, run through one or two applications in tech,
and highlight major underlying assumptions or common pitfalls.

Some of these tools will work better in certain situations than others.
Our goal is to get you the baseline knowledge you need to identify
which method to use for the questions that matter to you, and to
implement e ectively.

Lets get started.

Method 1: Controlled Regression

The idea behind controlled regression is that we might control directly


for the confounding variables in a regression of Y on X. The statistical
requirement for this to work is that the distribution of potential
outcomes, Y, should be conditionally independent of the treatment, X,
given the confounders, C.

Lets say we want to know the impact of some existing product feature
e.g., live chat support on an outcome, product sales. The why we
care is hopefully self-evident: if the impact of live chat is large enough
to cover the costs, we want to expand live chat support to boost pro ts;
if its small, were unlikely to expand the feature, and may even
deprecate it altogether to save costs.

We can easily see a positive correlation between use of chat support


and user-level sales. We also probably have some intuition around
likely confounders. For example, younger users are more likely to use
chat because they they are more comfortable with the chat technology;
they also buy more because they have more slush money.

Since youth is positively correlated with chat usage and sales, the raw
correlation between chat usage and sales would overstate the causal
relationship. But we can make progress by estimating a regression of
sales on chat usage controlling for age. In R:

fit <- lm(Y ~ X + C, data = ...)


summary(fit)

The primary and admittedly frequent pitfall in controlled


regression is that we often do not have the full set of confounders wed
want to control for. This is especially true when confounders are
unobservable either because the features are measurable in theory
but are not available to the econometrician (e.g., household income),
or because the features themselves are hard to quantify (e.g., inherent
interest in the product).
In this case, often the best we can do within the controlled regression
context is to proxy for unobservables with whatever data we do have. If
we see that adding the proxy to the regression meaningfully impacts
the coe cient on the primary regressor of interest, X, theres a good
chance regression wont su ce. We need another tool.

But before moving on, lets brie y cover the concept of bad controls, or
why-we-shouldnt-just-throw-in-the-kitchen-sink-and-see-what-sticks
(statistically). Suppose we were concerned about general interest in the
product as a confounder: the more interested the user is, the more she
engages with our features, including live chat, and also the more she
buys from us. We might think that controlling for attributes like the
proportion of emails from us she opens could be used as a proxy for
interest. But insofar as the treatment (engaging in live chat) could in
itself impact this feature (e.g., because she wants to see follow-up
responses from the agent), we would actually be inducing included
variable bias. The take-away? Be wary of controlling for variables that
are themselves not xed at the time the treatment was determined.

Method 2: Regression Discontinuity Design

Regression discontinuity design, or RDD, is a statistical approach to


causal inference that takes advantage of randomness in the world. In
RDD, we focus in on a cut-o point that, within some narrow range,
can be thought of as a local randomized experiment.

Suppose we want to estimate the e ect of passing an ed-tech course on


earned income. Randomly assigning some people to the passing
treatment and failing others would be fundamentally unethical, so AB
testing is out. Since several hard-to-measure things are correlated with
both passing a course and income innate ability and intrinsic
motivation to name just two we also know controlled regression
wont su ce.

Luckily, we have a passing cuto that creates a natural experiment:


users earning a grade of 70 or above pass while those with a grade just
below do not. A student who earns a 69 and thus does not pass is
plausibly very similar to a student earning a 70 who does pass.
Provided we have enough users in some narrow band around the
cuto , we can use this discontinuity to estimate the causal e ect of
passing on income. In R:

library(rdd)
RDestimate(Y ~ D, data = ..., subset = ..., cutpoint = ...)

Heres what an RDD might look like in graphical form:


Lets introduce the concepts of validity:

Results are internally valid if they are unbiased for the


subpopulation studied.

Results are externally valid if they are unbiased for the full
population.

In randomized experiments, the assumptions underlying internal and


external validity are rather straightforward. Results are internally valid
provided the randomization was executed correctly and the treatment
and control samples are balanced. And they are externally valid so long
as the impact on the experimental group was representative of the
impact on the overall population.

In non-experimental settings, the underlying assumptions depend on


the method used. Internal validity of RDD inferences rests on two
important assumptions:

Assumption 1: Imprecise control of assignment around the cuto .


Individuals cannot control whether they are just above (versus just
below) the cuto .

Assumption 2: No confounding discontinuities. Being just above (versus


just below) the cuto should not in uence other features.

Lets see how well these assumptions hold in our example.

Assumption 1: Users cannot control their grade around the cuto . If


users could, for example, write in to complain to us for a re-grade or
grade boost, this assumption would be violated. Not sure either way?
Here are a couple ways to validate:
1. Check whether the mass just below the cuto is similar to the mass
just above the cuto ; if there is more mass on one side than the
other, individuals may be exerting agency over assignment.

2. Check whether the composition of users in the two buckets looks


otherwise similar along key observable dimensions. Do exogenous
observables predict the bucket a user ends up in better than
randomly? If so, RDD may be invalid.

Assumption 2: Passing is the only di erentiator between a 69 and a 70.


If, for example, users who get a 70 or above also get reimbursed for
their tuition by their employer, this would generate an income shock
for those users (and not for users with lower scores) and violate the no
confounding discontinuities assumption. The e ect we estimate would
then be the combined causal e ect of passing the class and getting
tuition reimbursed, and not simply the causal e ect of passing the class
alone.

What about external validity? In RDD, were estimating what is called a


local average treatment e ect (LATE). The e ect pertains to users in
some narrow, or local, range around the cuto . If there are di erent
treatment e ects for di erent types of users (i.e., heterogeneous
treatment e ects), then the estimates may not be broadly applicable to
the full group. The good news is the interventions wed consider would
often occur on the margin passing marginally more or marginally
fewer learners so a LATE is often exactly what we want to measure.

Method 3: Di erence-in-Di erences

The simplest version of di erence-in-di erences (or DD) is a comparison


of pre and post outcomes between treatment and control groups.

DD is similar in spirit to RDD in that both identify o of existing


variation. But unlike RDD, DD relies on the existence of two groups
one that is served the treatment (after some cuto ) and one that never
is. Because DD includes a control group in the identi cation strategy, it
is generally more robust to confounders.

Lets consider a pricing example. We all want to know whether we


should raise or lower price to increase revenue. If price elasticity (in
absolute value) is greater than 1, lowering price will increase purchases
by enough to increase revenue; if its less than 1, raising prices will
increase revenue. How can we learn where we are on the revenue
curve?

The most straightforward method would be a full randomization


through AB testing price. Whether were comfortable running a pricing
AB test probably depends on the nature of our platform, our stage of
development, and the sensitivity of our users. If variation in price
would be salient to our users, for example because they communicate
to each other on the site or in real life, then price testing is potentially
risky. Serving di erent users di erent prices can be perceived as unfair,
even when random. Variation in pricing can also diminish user trust,
create user confusion, and in some cases even result in a negative PR
storm.

A nice alternative to AB testing is to combine a quasi-experimental


design with causal inference methods.

We might consider just changing price and implementing RDD around


the date of the price change. This could allow us to estimate the e ect
of the price on the revenue metric of interest, but has some risks. Recall
that RDD assumes theres nothing outside of the price change that
would also change purchasing behavior over the same window. If
theres an internal shock a new marketing campaign, a new feature
launch or, worse, an external shock e.g., a slowing economy well
be left trying to disentangle the relative e ects. RDD results risk being
inconclusive at best and misleading at worst.

A safer bet would be to set ourselves up with a DD design. For example,


we might change price in some geos (e.g., states or countries), but not
others. This gives a natural control in the geos where price did not
change. To account for any other site or external changes that might
have co-occurred, we can use the control markets where price did not
change to calculate the counterfactual we would have expected in
our treatment markets absent the price change. By varying at the geo
(versus user) level, we also reduce the salience of the price variation
relative to AB testing.

Below is a graphical representation of DD. You can see treatment and


control geos in the periods pre and post the change. The delta between
the actual revenue (here per session cookie) in the treatment group
and the counterfactual revenue in that group provides an estimate of
the treatment e ect:
While the visualization is helpful (and hopefully intuitive), we can also
implement a regression version of DD. At its essence, the DD regression
has an indicator for treatment status (here being a treatment market),
an indicator for post-change timing, and the interaction of those two
indicators. The coe cient on the interaction term serves is the
estimated e ect of the price change on revenue. If there are general
time trends, we can control for those in the regression, too. In R:

fit <- lm(Y ~ post + treat + I(post * treat), data = ...)


summary(fit)

The key assumption required for internal validity of the DD estimate is


parallel trends: absent the treatment itself, the treatment and control
markets would have followed the same trends. That is, any omitted
variables a ect treatment and control in the same way.

How can we validate the parallel trends assumption? There are a few
ways to make progress, both before and after rolling out the test.

Before rolling out the test, we can do two things:

1. Make the treatment and control groups as similar as possible. In the


experimental set-up, consider implementing strati ed
randomization. Although generally unnecessary when samples are
large (e.g., in user-level randomization), strati ed randomization
can be valuable when the number of units (here geos) is relatively
small. Where feasible, we might even generate matched pairs
in this case markets that historically have followed similar trends
and/or that we intuitively expect to respond similarly to any
internal product changes and to external shocks. In their 1994
paper estimating the e ect of minimum wage increases on
employment, David Card and Alan Krueger matched restaurants in
New Jersey with comparable restaurants in Pennsylvania just
across the border; the Pennsylvania restaurants provided a
baseline for determining what would have happened in New
Jersey if the minimum wage had remained constant.

2. After the strati ed randomization (or matched pairing), check


graphically and statistically that the trends are approximately
parallel between the two groups pre-treatment. If they arent, we
should rede ne the treatment and control groups; if they are, we
should be good to go.

Ok, so weve designed and rolled out a good experiment, but with
everyone moving fast, stu inevitable happens. Common problems
with DD often come in two forms:

Problem 1: Confounders pop up in particular treatment or control


markets. Maybe mid-experiment our BD team launches a new
partnership in some market. And then a di erent product team rolls out
a localized payment processor in some other market. We expect both of
these to a ect our revenue metric of interest.

Solution: Assuming we have a bunch of treatment and control markets,


we can simply exclude those markets and their matches if its a
matched design from the analysis.

Problem 2: Confounders pop up across some subset of treatment and


control markets. Here, theres some change internal or external
that were worried might impact a bunch of our markets, including
some treatment and some control markets. For example, the Euro is
taking a plunge and we think the uctuating exchange rate in those
markets might bias our results.

Solution: We can add additional di erencing by that confounder as a


robustness check in whats called a di erence-in-di erence-in-
di erences estimation (DDD). DDD will generally be less precise than
DD (i.e., the point estimates will have larger standard errors), but if the
two point estimates themselves are similar, we can be relatively
con dent that that confounder is not meaningfully biasing our
estimated e ect.

Pricing is an important and complicated beast probably worthy of


additional discussion. For example, the estimate above may not be the
general equilibrium e ect we should expect: in the short run, users
may be responding to the change in price, not just to the new price
itself; but in the long-run, users likely only to respond to the price itself
(unless prices continue to change). There are several ways to make
progress here. For example, we can estimate the e ect only on new
users who had not previously been served a price and so for whom the
change would not be salient. But well leave a more extended
discussion of pricing to a subsequent post.

Method 4: Fixed E ects Regression

Fixed e ects is a particular type of controlled regression, and is perhaps


best illustrated by example.

A large body of academic research studies how individual investors


respond (irrationally) to market uctuations. One metric a n-tech rm
might care about is the extent to which it is able to convince users to
(rationally) stay the course and not panic during market
downturns.

Understanding what helps users stay the course is challenging. It


requires separating what we cannot control general market
uctuations and learning about those from friends or news sources
from what we can control the way market movements are
communicated in a users investment returns. To disentangle, we once
again go hunting for a source of randomness that a ects the input we
control, but not the confounding external factors.
Our approach is to run a xed e ects regression of percent portfolio sold
on portfolio return controlling (with xed e ects) for the week of
account opening. Since the xed e ects capture the week a user opened
their account, the coe cient on portfolio return is the e ect of having a
higher return relative to the average return of other users funding
accounts in that same week. Assuming users who opened accounts the
same week acquire similar tidbits from the news and friends, this
allows us to isolate the way we display movements in the users actual
portfolio from general market trends and coverage.

Sound familiar? Fixed e ects regression is similar to RDD in that both


take advantage of the fact that users are distributed quasi-randomly
around some point. In RDD, there is a single point; with xed e ects
regression, there are multiple points in this case, one for each week
of account opening.

In R:

fit <- lm(Y ~ X + factor(F), data = )


summary(fit)

The two assumptions required for internal validity in RDD apply here as
well. First, after conditioning on the xed e ects, users are as good as
randomly assigned to to their X values in this case, their portfolio
returns. Second, there can be no confounding discontinuities, i.e.,
conditional on the xed e ects, users cannot otherwise be treated
di erently based on their X.

For the xed e ects method to be informative, we of course also need


variation in the X of interest after controlling for the xed e ects. Here,
were ok: users who opened accounts the same week do not necessarily
have the same portfolio return; markets can rise or fall 1% or more in a
single day. More generally, if theres not adequate variation in X after
controlling for xed e ects, well know because the standard errors of
the estimated coe cient on X will be untenably large.

Method 5: Instrumental Variables

Instrumental variable (IV) methods are perhaps our favorite method for
causal inference. Recall our earlier notation: we are trying to estimate
the causal e ect of variable X on outcome Y, but cannot take the raw
correlation as causal because there exists some omitted variable(s), C.
An instrumental variable, or instrument for short, is a feature or set of
features, Z, such that both of the following are true:

Strong rst stage: Z meaningfully a ects X.

Exclusion restriction: Z a ects Y only through its e ect on X.

Who doesnt love a good picture?


If these conditions are satis ed, we can proceed in two steps:

First stage: Instrument for X with Z

Second stage: Estimate the e ect of the (instrumented) Z on Y

In R:

library(aer)
fit <- ivreg(Y ~ X | Z, data = df)
summary(fit, vcov = sandwich, df = Inf, diagnostics = TRUE)

Ok, so where do we nd these magical instruments?

Economists often nd instruments in policies. Josh Angrist and Alan


Krueger instrument for years of schooling with the Vietnam Draft
lottery; Steve Levitt instruments for prison populations with prison
overcrowding litigation. Although good instruments in the real world
can generate incredible insights, they are notoriously hard to come by.

The good news is that instruments are everywhere in tech. As long as


your company has an active AB testing culture, you almost certainly
have a plethora of instruments at your ngertips. In fact, any AB test
that drives a speci c behavior is a contender for instrumenting ex post
for the e ect of that behavior on an outcome you care about.

Suppose we are interested in learning the causal e ect of referring a


friend on churn. We see that users who refer friends are less likely to
churn, and hypothesize that getting users to refer more friends will
increase their likelihood of sticking around. (One reason we might
think this is true is what psychologists call the Ikea E ect: users care
more about products that they have invested time contributing to.)

Looking at the correlation of churn with referrals will of course not give
us the causal e ect. Users who refer their friends are de facto more
committed to our product.

But if our company has a strong referral program, its likely been
running lots of AB tests pushing users to refer more email tests, onsite
banner ad tests, incentives tests, you name it. The IV strategy is to focus
on a successful AB test one that increased referrals and use that
experiments bucketing as an instrument for referring. (If IV sounds a
little like RDD, thats because it is! In fact, IV is sometimes referred to as
Fuzzy RDD.)

IV results are internally valid provided the strong rst stage and
exclusion restriction assumptions (above) are satis ed:
Well likely have a strong rst stage as long as the experiment we
chose was successful at driving referrals. (This is important
because if Z is not a strong predictor of X, the resulting second
stage estimate will be biased.) The R code above reports the F-
statistic, so we can check the strength of our rst stage directly. A
good rule of thumb is that the F-statistic from the rst stage should
be least 11.

What about the exclusion restriction? Its important that the


instrument, Z, a ect the outcome, Y, only through its e ect on the
endogenous regressor, X. Suppose we are instrumenting with an
AB email test pushing referrals. If the control group received no
email, this assumption isnt valid: the act of getting an email could
in and of itself drive retention. But if the control group received an
otherwise-similar email, just without any mention of the referral
program, then the exclusion restriction likely holds.

A quick note on statistical signi cance in IV. You may have noticed that
the R code to print the IV results isnt just a simple call to summary( t).
Thats because we have be careful about how we compute standard
errors in IV models. In particular, standard errors have to be corrected
to account for the two stage design, which generally makes them
slightly larger.

Want to estimate the causal e ects of X on Y but dont have any good
historical AB tests on hand to leverage? AB tests can also be
implemented speci cally to facilitate IV estimation! These AB tests even
come with a sexy name Randomized Encouragement Trials.

. . .

Its no surprise that we invest in building predictive models to


understand who will do what when. Its always fun to predict the
future.

But its even more fun to improve that future. Where, when, how, and
with whom can we intervene for better outcomes? By shedding light on
the mechanisms driving the outcomes we care about, causal inference
gives us the insights to focus our e orts on investments that better
serve our users and our business.

Today, we brie y covered a range of methods for causal inference when


AB testing is o the table. We hope these methods will help you
uncover some of the actionable insights that can move your companys
mission forward.

Comments? Suggestions? Reach out at emily@coursera.org and


duncan@wealthfront.com. Just want to do fun stu with us? Wed love
to hear from you! Plus, were always hiring.

You might also like