What we’ll dive into today

Everybody loves measurement, but the reality is it's not always possible to run A/B tests.

For example, what if you have a new launch that the company wants to send out and blast to everybody? If you proposed having 50% as a holdout, you'd get run out of the room.

Or what if you're a marketplace and you want to run a new pricing algorithm change but a classic A/B test would break the laws of independence?

Say hello to propensity matching and switchbacks 👋

They allow teams to still measure, albeit they're not technically causal, with more certainty and sophistication than just doing a pre-post ( or any other less sophisticated methodology).

Below, I'll share:

  1. What these methodologies are

  2. When to use them

  3. How to set them up

Switchbacks

At the simplest level, switchbacks look like an A/B test, but the randomization happens based on time level instead of at a user level. The chart above does a great job visualizing it.

The philosophical idea is still the same though. You have some data where the "treatment” is on and then some data where the “treatment” is off allowing you to compare.

However, to understand why teams use switchbacks, we need to understand why they wouldn't A/B test.

When to use them

When you A/B test, there's an assumption of independence, and that one user's experience doesn't impact other users. Unfortunately, marketplaces are a classic example where because of the interaction between supply and demand it can quickly become non-independent.

Let's say we want to test the impact of a new pricing algorithm on riders in a ride hailing marketplace. In a standard A/B test, 50% of riders would have the new pricing algorithm and 50% wouldn’t.

Sounds fine… but… here’s the problem:

  1. New pricing algorithm lowers fares for 50% of riders

  2. Riders with lower fares request more

  3. Drivers accept more of these trips because there’s more demand

  4. Less drivers available for riders who don’t have new pricing algorithm

  5. Price goes up for riders without new pricing algorithm aka dynamic price

  6. Riders with higher prices request less

So not only do riders who have lower fares ride more because they have lower fares but also in comparison they ride more because the other 50% have higher prices than normal. Independence is not maintained. The results are now contaminated.

WHAT DO WE DO?!

With a switchback EVERYONE is either in treatment or control.

Crisis averted.

How to set it up

There are two parts to a switchback: the setup and the analysis.

The setup requires:

  1. Choosing a randomization time interval

  2. Choosing a randomization methodology

Choosing a randomization time interval

The time interval needs to be long enough to see what we want to measure but short enough to get as many samples in the time frame as possible. We also want an even sampling of test and control across days of week and times of day over the course of the experiment. To capture these business cycles, switchbacks are often a minimum of 2 weeks.

Revisiting the Uber example, marketplaces change every minute, every hour, and every day. Sunday doesn't look like Monday, Monday is different than Friday, etc.

Randomizing every minute wouldn't make sense because it takes a few minutes for a trip to be requested and made. Every day doesn’t make sense either because there’s too much variability between days. So a standard practice at Uber was to randomize for every 90 minutes.

Choosing a randomization methodology

The simplest methodology is to alternate treatment vs control from one time interval to the next.

There’s nothing inherently wrong with that but depending on your set up it might be the case that the same times of day and days of week are in treatment and control every time. That would bias the result so be careful here.

Another methodology is to randomize the randomization itself. So every time period becomes a coin flip, and you can have that time period be treatment or control. Sometimes you'll get two treatments in a row or two controls in a row, et cetera, and you randomize that way

Analyzing switchbacks

To analyze switchbacks, you can’t use a standard T-test or Z-test like you commonly would with an A/B test. Instead, the most well-established methodology is to use a regression. The idea is that you control for all the other factors including time of day , day of week, and geo (if that’s a part of your test) and then estimate the impact from the binary variable of treatment vs control.

Example from DoorDash

DoorDash has a great blog for more detail.

Challenges

The biggest challenge with switchbacks is in the analysis. It’s a relatively easy test to set up (maybe even simpler than an A/B test) but analyzing it can be tricky especially in more complex situations. It’s why most switchbacks are done by more complex marketplace teams with earlier stage marketplace teams just resorting to A/B testing. However, if you feel comfortable running the data science behind it then do it! It’s much much better 🙂 .

Propensity matching

If switchbacks are another form of A/B testing then one way to think about propensity matching is it’s the reverse of an A/B test. No, not a B/A test. Don’t be silly.

In a standard A/B test, you pick who is in treatment and who is in control before the experiment starts. In propensity matching, you know the audience you want to analyze (treatment) and then you look backwards at prior data to find who is similar to this group to find your control.

It might sound like cheating or fudging but it’s really not. It’s actually quite a clever way to assess impact when you can’t A/B Test.

Just like with switchbacks, let’s look at an example for when you can’t just simply A/B test.

When to use them

Imaging you're a company launching your loyalty program for the first time!

You've done all the research and spent months developing the program, and now want your entire audience to know about it. But, leadership is asking to understand the impact of the loyalty program and whether it's worth investing in. How do you launch a loyalty program where only half get it? It's even harder if you're in a marketplace because this would again impact marketplace dynamics.

So, you use propensity matching to answer the question of “How did my loyalty program impact users?”

How to set it up

Let’s continue with our loyalty program example.

Let’s say the loyalty program launched on November 1st. By November 14th, there will be some part of your users that have entered the loyalty program. But there will also be a large part of your users that didn’t purchase it.

Now it wouldn’t make sense to compare those that entered the loyalty program to those that didn’t because you’re not comparing apples to apples.

You’d be comparing users that self selected against those that didn’t.

Instead, what if we compared users that “look” the same but the only difference is purchased the loyalty program vs not.

Let’s say that 70% of people that entered your loyalty program were male while your normal audience is about 55% male. Well, that seems biased already.

But then we use propensity matching and find all of the male and females that behaved like the people that entered the loyalty program except they decided not to enter the loyalty program. Let’s count all of those users as control. Now it’s back to a 70% in control and treatment! Yay.

Here’s a good chart below that will highlight what I mean:

Using propensity matching, you normalize all the variables that might be the reason a person did / didn’t make a decision so you’re left with the only difference between two groups of users being the decision itself. Hey! You’ve isolated the decision and thus the effect of that decision!

Now that’s just one variable. You can extend this type of analysis across many variables to normalize it further removing as many biases as possible.

Here’s a good article to better understanding.

Challenges

The challenges with this methodology are in finding the suitable “control” users as it requires that you’ve accounted for many biases. There’s no real great way to avoid this other than having a deep understanding of your users but also ensuring that you’re running the right code and translating the output correctly. I’ve seen many an analyses that get derailed because the interpretation of the results were in correct.

A second challenge in this methodology is explaining this to outside parties. The term “Causal” and “Inference” already are complicated terms but to combine them to call it “Causal Inference” makes it even more complex.

The best way I’ve seen it communicated is this:

“We couldn’t A/B test it because we’d have to hold out 50%. Instead we used propensity matching where we’re able to compare users that [insert effect to measure] to other users who didn’t [insert effect to measure] but behaved very similarly before. This gives us the closest we can get to apples to apples comparisons”

That’s it for this week.

Go forth and measure!

Was this article helpful?

Login or Subscribe to participate

Reply

or to participate