- experiMENTAL
- Posts
- Breakdown of 10 common Analytics concepts
Breakdown of 10 common Analytics concepts
Everyone is being pushed to be more data-driven. Not everyone has the know how. I'm breaking down 10 of the most common analytics concepts so you can feel empowered.
👋 Hey, it’s Sundar! Thanks for reading experiMENTAL: my newsletter helping founders and marketers navigate the CRAZY world of consumer tech with secrets from 10+ years in Marketing at Uber & others.
Every growth leader I’ve worked with in 10+ years in Marketing Data Science for consumer tech companies like Uber has wanted to feel more empowered to prove the ROI of their work.
To do that, you need to speak the right language allowing you to have deeper conversations, ask better questions, and make better decisions.
Below I’ll break down 10 of the most common analytical concepts to help you do exactly that.
For each concept I’ll share:
A brief explanation
When to use it
An example
I’ve also ordered it from least to most advanced.
Concepts
Distribution
The concept of distribution is extremely important.
It’s the foundation for statistics.
A distribution shows how frequently different values occur in a data set.
Let’s look at the histogram below (which is a classic way to plot distribution).
The X axis is the values of the data in our data set.
The Y axis is how frequently those values occur.
The way the chart looks is known as the shape of the distribution.
The shape provides insights about the nature of the data and is a starter point for most analyses.
When to use it:
All the time.
It’s the best way to understand the pattern of our data.
It helps you identify what the data looks like while spotting any outliers.
Example
On the X axis we plot “Response time (hrs)”.
On the Y axis we plot frequency. This shows for each value of response time, how many times we responded that quickly.
175 times we responded in 2 hours.
70 times we responded in 13 hours.
In isolation, these pieces of information are useful.
But, when you zoom out, you see some very interesting patterns.
We’re mostly responding in < 5 hours 👍️
A good chunk of responses are between 10 - 20 hours 👎️
The next step would be to dig in and see why responses are taking 10-20 hours.
Common Distributions
When you start plotting distributions, you’ll start to see patterns.
These patterns already have names:
Normal Distribution: Looks symmetrical with most values clustered around a single point.
Many natural phenomena follow this distribution (like human height).
Of all of the distributions, the normal is the one you’ll see the most and it’s so common it’s called normal.
Uniform Distribution: All values show up with similar frequencies.
A good example is birthdays by month.
Birthdays are evenly (or uniformly 😉 hence the name ) distributed across months.
Skewed: Looks like a staircase.
A good example is income where there’s a “tail” representing many outliers. Thanks millionaires and billionaires. Could have just been one of us.
Bimodal: Two peaks.
This suggests two distinct groups within the data which is a great cue that you might have to better segment your data.
Median vs. Mean
First, Mean is just the fancy way of saying average. I’m just going to use average from now on.
Median and Average are 2 different ways to describe what the “middle” of a set of data looks like.
It’s a way to summarize the general ballpark of what you think the data looks like.
It might seem straight forward, but they have very important consequences.
In almost all cases, I recommend calculating Median and Average to get a better understanding of the data.
Median | Average | |
---|---|---|
Calculation | The middle value when data is ordered from lowest to highest. | The average of all values. |
Let’s look at a random data set:
1 | Median |
2 | 5 |
3 | |
5 | Average |
8 | 7.5 |
13 | |
21 |
Because Median is just based on the order of data it cares less about “outliers”.
I could change the 21 to 21,000 and the Median would not change.
The Average on the other hand is affected by every piece of data.
It’s impacted by extreme outliers (low and high)
When to use it
Use the Median when you have skewed data or outliers that might make the Average look better or worst than it is.
Use the Average when you want to take into account all values equally.
Pro tip: Use them both to get a better sense of the data
Example
We’re making a decision on budgets based on CAC: LTV ratios and you want to know what the LTV of your customers is.
Mode is just the number that occurs the most.
Let’s chart the distribution of LTVs of our customers above.
Most customers have a decent LTV but a few have really high LTVs.
You calculate the Median and Average LTV.
The Average (or Mean) will be higher than a standard customer because it’s being “inflated” by the large numbers.
Instead, use the Median.
You’ll have a more conservative number and a better representation of the normal customer.
If you used Mean you could be spending too much on CACs.
Percentiles
We start by plotting a distribution of your data.
The “Nth percentile” is the value where N% of the data is less than that value.
10th percentile means 10% of the data is less than that value.
Let’s look at some data as an example:
Random Values | Sorted Data |
---|---|
7 | 0 |
8 | 1 |
10 | 3 |
0 | 4 |
6 | 6 |
15 | 7 |
1 | 8 |
4 | 10 |
3 | 11 |
11 | 15 |
There are 10 data points that have been sorted.
We want to find the 40th percentile.
Where is 40% of the data is less than a value.
With 10 data points, we have 40% of the data less than 6.
This means 6 is the 40th percentile.
Random Values | Sorted Data | Percentile |
---|---|---|
7 | 0 | |
8 | 1 | 10% |
10 | 3 | 20% |
0 | 4 | 30% |
6 | 6 | 40% |
15 | 7 | 50% |
1 | 8 | 60% |
4 | 10 | 70% |
3 | 11 | 80% |
11 | 15 | 90% |
When to use it
When you want to understand the distribution of your data, identify outliers, or compare individual data points to the rest of the dataset.
Example
At Uber, we would use percentiles to measure user experience.
We would look at ETAs for drivers to pick up riders.
Our average pickup ETA was 3 minutes which sounds great.
But, our 90th percentile was 12 minutes.
This means 10% of the time customers had to wait > 12 minutes to have the driver pick them up.
That’s a really poor UX and it’s something we’d work hard to improve.
Variance and Standard Deviation
We’ve learned about Median & Mean (Average) as a way to understand the general ball park of what’s happening to a distribution.
Variation gives us another piece of information.
Variation quantifies how spread out a set of data is.
Effectively, is the data similar to each other or quite different?
Let’s unpack that with an example.
Data Set 1 | Data Set 2 |
---|---|
1 | -5 |
1 | -3 |
1 | -1 |
1 | 1 |
1 | 3 |
1 | 5 |
1 | 7 |
If we were to calculate the average for both sets, the average would be 1.
The Variance for both of those will be extremely different.
The Variance for Data Set 1 is 0.
The Variance for Data Set 2 is 18.6.
Ignore what the numbers mean but clearly Data Set 2 has more Variance.
This makes sense. It’s obvious that Data Set 2 is more “spread out”.
The numbers are all over the place.
When to use it
Like much of the concepts here the answer is always. Specifically, when you’re first exploring data or want to better understand it calculate Variance.
It helps you understand how spread out data is and if there are many outliers.
Example
Say you’re looking at daily website traffic. You need to understand how consistent your visitor numbers are and identify unusually high or low traffic days.
Calculate the average and Variance of a prior time period and then you’ll be able to know what the normal Variance is. Anything above or below that Variance will tell you that it might be unusually high or low.
Analyzing the variance in daily website traffic to understand how consistent your visitor numbers are and identify unusually high or low traffic days.
Bonus: Standard Deviation
Standard Deviation is simply the square root of the Variance. It’s a term you’ll hear more often.
For both Variance and Standard Deviation the higher the number the more spread out the data is and you’ll have to be careful of outliers.
Reply