This weekend, I stumbled across a post that really hit close to home. The post was written by Brad Feld, a VC whose blog I’ve subscribed to since college. Brad normally writes about startups, startup-communities, and marathons in which he participates…none of which I’m particularly interested in anymore. But in this particular post, Revisiting Median vs. Average With Accelerator Data, Brad makes some good points that relate to my work as a BI professional.
In the post, Brad argues that way too many people confuse/misuse the calculations “Average”, “Mean”, and “Median”…and that in certain cases “Median” is the better choice for a metric. After a brief reflection, I couldn’t recall a single time where I’d added a median-based metric to a report, dashboard, or scorecard. I’ve used Average-based measures plenty of times. Were there times when I used average but I should have used median?
In the rest of this post we’ll cover these three calculations (Average, Mean, Median) and I’ll take you through a hypothetical scenario where using a median-based metric is more appropriate than an average-based metric.
Average, Mean, Median…Potatoe, Potato, Potootooooooo?
Before we dig in, let’s knock out the vocabulary:
Average = sum of the parts divided by the number of parts
Mean = sum of the parts divided by the number of parts
Median = the value that falls directly between the top and bottom 50% of items in the set.
And, yes, Average and Mean have the same definition above…which is fine for most people and most scenarios. However, for the
uber-nerds stats-pros, only Arithmetic-Mean is equivalent to Average … as opposed to the long list of other Means.
Scenario: Defining Targets for Performance Metrics
Now that we’ve got the vocab out of the way, let’s roll through a scenario that will help illustrate where the median-based calculation is more appropriate.
In this scenario, we’ll pretend to be the sales director for a large retail company charged with setting the targets for the various sales performance metrics in the upcoming year… one of the metrics is sales amount per customer order.
Fun Fact: in Good to Great, a book covering the characteristics and actions taken by companies that managed to transcend “good” performance to reach and sustain “great” performance, Walgreens used a similar metric (profit per customer visit) to skyrocket it from a nobody into an enduring success.
One option is to simply set the target for this metric based the actual value for the prior year. Assuming the average sales transaction amount for the prior year is ~$100, then our target for next year will be to raise that value by 10% to $110. This is a decent strategy assuming that a) there’s some logic behind the 10% increase, and b) the sales transaction amounts roughly follow a normal distribution – such as the one below:
But what happens if our set of sales transactions contain a few outliers…or in this case, 1 really really LARGE outlier? Let’s say there was a single transaction for $10,000,000. Here’s what our updated histogram looks like:
Note 1: the screenshot above shows a frequency of 50 for the $10,000,000 outlier transaction. I only did that to make the bar long enough so that I could actually mouse over to get the details. In the rest of the calculations including this outlier, there’s only a single transaction for $10,000,000. Furthermore, it appears Excel 2013 is auto-breaking the x-axis … 10,000,000 is a far cry from closest x-axis label of $105. Unfortunately, after spending an embarrassing amount of time, I gave up trying to “unbreak” the x-axis. If you know how to do this please shoot me a comment
Due to the sheer magnitude of this single transaction, our average sales transaction amount for the prior year increases dramatically from $100 to ~$260. Setting the target at a 10% increase doesn’t seem like such a good idea anymore, does it? In fact, without another outlier transaction or two, that goal is wayyyy past “stretch-territory” and almost completely unattainable.
Dealing with Outliers
There are many potential causes of outliers, and such a discussion is outside the scope of this post. Instead we’re going to simply acknowledge that they exist and focus on dealing with them.
One option for dealing with outliers is to discard them from the set prior to calculating the average. Another option is to just use Median instead of Average.
From the scenario above, in the first case where the transactions followed a normal distribution and there weren’t any outliers, the Average and the Median were both ~$100. However, in the second part where we introduced the outlier, the Average was $260 but the Median was still ~$100!
Here’s why: When Bill Gates Walks into a Bar
- For a skewed data distribution, median is a better measure of center [...].
- For any reasonably symmetric distribution with no outliers, mean is a better measure of center [...].
So here’s the point…if you are going to define metrics or targets based on central-tendency, you need to profile the data and choose the calculation that best fits the circumstances.
Business Intelligence is about measurements. And while it’s easy to just sit back and churn out reports and dashboards to spec (I know because I’ve been guilty of this behavior), as Business Intelligence professionals it is our duty, our obligation, to step up and question the metrics, initiate open discussions with our clients/customers, and ensure they are measuring accurately and productively!