What we’ll dive into today

Everybody loves measurement, but the reality is it's not always possible to run A/B tests.

For example, what if you have a new launch that the company wants to send out and blast to everybody? If you proposed having 50% as a holdout, you'd get run out of the room.

Or what if you're a marketplace and you want to run a new pricing algorithm change but a classic A/B test would break the laws of independence?

Say hello to propensity matching and switchbacks 👋

They allow teams to still measure, albeit they're not technically causal, with more certainty and sophistication than just doing a pre-post ( or any other less sophisticated methodology).

Below, I'll share:

What these methodologies are
When to use them
How to set them up

Switchbacks

https://www.statsig.com/blog/switchback-experiments

At the simplest level, switchbacks look like an A/B test, but the randomization happens based on time level instead of at a user level. The chart above does a great job visualizing it.

The philosophical idea is still the same though. You have some data where the "treatment” is on and then some data where the “treatment” is off allowing you to compare.

However, to understand why teams use switchbacks, we need to understand why they wouldn't A/B test.

When to use them

When you A/B test, there's an assumption of independence, and that one user's experience doesn't impact other users. Unfortunately, marketplaces are a classic example where because of the interaction between supply and demand it can quickly become non-independent.

Let's say we want to test the impact of a new pricing algorithm on riders in a ride hailing marketplace. In a standard A/B test, 50% of riders would have the new pricing algorithm and 50% wouldn’t.

Sounds fine… but… here’s the problem:

New pricing algorithm lowers fares for 50% of riders
Riders with lower fares request more
Drivers accept more of these trips because there’s more demand
Less drivers available for riders who don’t have new pricing algorithm
Price goes up for riders without new pricing algorithm aka dynamic price
Riders with higher prices request less

So not only do riders who have lower fares ride more because they have lower fares but also in comparison they ride more because the other 50% have higher prices than normal. Independence is not maintained. The results are now contaminated.

WHAT DO WE DO?!

With a switchback EVERYONE is either in treatment or control.

Crisis averted.

How to set it up

https://doordash.engineering/2019/02/20/experiment-rigor-for-switchback-experiment-analysis/

There are two parts to a switchback: the setup and the analysis.

The setup requires:

Choosing a randomization time interval
Choosing a randomization methodology

Choosing a randomization time interval

The time interval needs to be long enough to see what we want to measure but short enough to get as many samples in the time frame as possible. We also want an even sampling of test and control across days of week and times of day over the course of the experiment. To capture these business cycles, switchbacks are often a minimum of 2 weeks.

Revisiting the Uber example, marketplaces change every minute, every hour, and every day. Sunday doesn't look like Monday, Monday is different than Friday, etc.

Randomizing every minute wouldn't make sense because it takes a few minutes for a trip to be requested and made. Every day doesn’t make sense either because there’s too much variability between days. So a standard practice at Uber was to randomize for every 90 minutes.

Choosing a randomization methodology

The simplest methodology is to alternate treatment vs control from one time interval to the next.

There’s nothing inherently wrong with that but depending on your set up it might be the case that the same times of day and days of week are in treatment and control every time. That would bias the result so be careful here.

Another methodology is to randomize the randomization itself. So every time period becomes a coin flip, and you can have that time period be treatment or control. Sometimes you'll get two treatments in a row or two controls in a row, et cetera, and you randomize that way

Analyzing switchbacks

To analyze switchbacks, you can’t use a standard T-test or Z-test like you commonly would with an A/B test. Instead, the most well-established methodology is to use a regression. The idea is that you control for all the other factors including time of day , day of week, and geo (if that’s a part of your test) and then estimate the impact from the binary variable of treatment vs control.

Example from DoorDash

DoorDash has a great blog for more detail.

Challenges

The biggest challenge with switchbacks is in the analysis. It’s a relatively easy test to set up (maybe even simpler than an A/B test) but analyzing it can be tricky especially in more complex situations. It’s why most switchbacks are done by more complex marketplace teams with earlier stage marketplace teams just resorting to A/B testing. However, if you feel comfortable running the data science behind it then do it! It’s much much better 🙂 .

Propensity matching

If switchbacks are another form of A/B testing then one way to think about propensity matching is it’s the reverse of an A/B test. No, not a B/A test. Don’t be silly.

In a standard A/B test, you pick who is in treatment and who is in control before the experiment starts. In propensity matching, you know the audience you want to analyze (treatment) and then you look backwards at prior data to find who is similar to this group to find your control.

It might sound like cheating or fudging but it’s really not. It’s actually quite a clever way to assess impact when you can’t A/B Test.

Just like with switchbacks, let’s look at an example for when you can’t just simply A/B test.