Two-Stage Least Squares For A/B Tests

At Twitch, we run a lot of experiments — it’s the best way to confidently establish causation between product features and user behavior. Sometimes, though, we want to understand the causal relationship between two variables where an experiment that randomly assigns the hypothesized independent variable to a subset of our users isn’t feasible. For example, we might want to answer the question:

Does having more friends on Twitch makes a user more likely to return to the site?

Seems like a simple question, but the ideal experiment in this case would be randomly assigning some users to have more friends than others — obviously there’s no great experimental design that can accomplish this.

The next best thing is to find a treatment that we can apply that will create some amount of random variation in the independent variable and use that to understand its effect on the dependent variable. Sticking with the friends example, we could try an A/B test where the treatment group gets a prompt to find and add some friends when they land on the homepage. Almost certainly the treatment group in this test will have a higher number of friends compared to the control, and this difference will be uncorrelated with anything else since it is caused by a randomly-assigned treatment. We can then use this random variation to understand how having more friends on Twitch affects retention without the problem of confounding factors.

This is the approach I’ll be walking through in this post. The benefit of this methodology is that, as with any experiment, we are assured of an unbiased causal estimate, but unlike a typical experimental design, we’ll be able to talk directly about the effect of having more friends.

First Stage Regression

If we ran the above experiment, we might see something like the following hypothetical regression output:

model <- lm(num_friends ~ is_treatment, data=exp_a)
summary(model)

This will look familiar if you’ve ever done any regression modeling in R. If not, here’s the interpretation:

  • The Intercept term is the mean of the outcome variable (in this case, number of friends) among users in the control group. The average user in the control group has 27.4 friends.
  • The coefficient for is_treatment, a binary variable specifying which group the user was assigned to, is the average difference in the outcome variable between treatment and control. The average treatment-group user has 2.1 more friends than the average control-group user.
  • The standard error, t-value, and p-value help us evaluate how likely it would be to observe estimates of the intercept and coefficients this large under the null hypothesis, i.e. when their true value is equal to 0. The very small p-value on our is_treatment coefficient (<.001) suggests that it is very unlikely that the treatment effect is due to chance.

This is all great, but what we really want is to understand whether having more friends causes a user to return, in other words, the causal relationship expressed as:

returns_next_day ~ num_friends

Remember though, in this scenario only a small piece of the variance in the number of friends a user has is randomly assigned (that +2.1 difference from the experimental treatment); the rest of the variation is self-selected, and so specifying the model in the form above will not allow us to give a causal interpretation to that relationship. The fact is that in any OLS model built from data without an experiment, the independent variable will suffer from omitted variable bias, meaning that the independent variable is probably correlated with the error term u, making it impossible to have an unbiased estimate of the independent variable’s effect.

Introducing two-stage least-squares

One solution to this issue consists of using two-stage least squares in an instrumental variable design. If in our model we replace the actual observed number of friends a user has with a predicted number of friends generated only from a variable that is not correlated with u, we’ll be able to build a much better model:

returns_next_day ~ predicted_num_friends

So why is this model any better? In this equation, the predicted number of friends is independent of any unobserved variables, allowing us to make a causal conclusion about the relationship between number of friends and likelihood of returning. Another way to think about this is that the predicted number of friends captures only the variance in number of friends that was randomly generated through the experimental treatment.

In this type of analysis, we call is_treatment our instrumental variable, hence the name of the technique. The general requirements of an instrumental variable are that it is an exogenous variable that is correlated with the independent variable of interest, but uncorrelated with any unobserved variables that affect the dependent variable. In most instrumental variable designs, the exogeneity assumption (i.e. that it affects our dependent variable ONLY through the one single independent variable) is unverifiable and must be based on some strong assumptions. But with an experiment, we are guaranteed exogeneity due to random assignment.

To generate our predicted number of friends for each user, we first regress number of friends on the binary variable indicating whether or not the user was in the treatment group. We can then predict the number of friends for each user from that model’s coefficient and intercept.

model <- lm(num_friends ~ is_treatment, data=exp_a)
exp_a$predicted_num_friends <- model$fitted.values

In fact, this is the first model we looked at above, and since there are only two values of is_treatment(‘Yes’ or ‘No’), we will only have two predicted values for number of friends: control = 27.42 (the intercept term) and treatment = 29.53 (the intercept term + the treatment effect). Once we have these values, we can perform the second stage regression and look at the coefficient to understand how a user’s number of friends affects the likelihood of returning to Twitch.

second_stage_model <- lm(
returns_next_day ~ predicted_num_friends, data=exp_a)
summary(second_stage_model)

The intercept tells us that the average user with no friends has a probability of returning of .23, and more importantly the coefficient for predicted number of friends gives us an answer to our original question: each friend increases that likelihood by .00075.

The SEM Package

While you can do this analysis in two steps in the way that I‘ve shown above, there’s also a nice R package called sem which implements a function for two-stage least squares, called tsls(), in a single step.

library(sem)
model <- tsls(
returns_next_day ~ num_friends, ~ is_treatment, data =exp_a)

The first argument specifies the final model we’re interested in, and the second argument specifies the variable from which we’d like to predict the independent variable in the first argument. In this case, since we’re using is_treatment to predict number of friends, we supply the one-sided model formula ~ is_treatment.

Closing remarks

Usually with an instrumental variable design, the goal is causal inference from non-experimental data, but here we have a true, randomized experiment treatment to use as our instrumental variable. This means that we don’t have to worry about the assumptions of an exogenous instrument, since we are guaranteed that with random assignment

It’s true that we could model the outcome variable as a function of treatment directly, instead of using this two-stage methodology. The advantage, however, of using a two-stage approach like this is that we get to talk about the causal effect of the variable we actually care about, instead of a treatment that is one step removed. This means we can generalize and hypothesize about how having more friends might affect user retention, even if the increased friend count was being driven by a completely different mechanism. There’s no guarantee of this generalization of course, but, even so, being able to talk about about fundamental concepts like the value of having more friends on Twitch is crucial to building knowledge about our product and our users.


Two-Stage Least Squares For A/B Tests was originally published in Twitch Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.