Using median rather than mean

Analysts in consumer finance are used to dealing with large sets of incoming customer data, and are often asked to provide summary statistics to show whether application quality is changing over time: Are credit scores going down? What’s the average monthly income? What’s the average age of the customers applying this quarter, compared to last?

Early on in the process, the data is likely to be messy, unsanitised, and potentially chock-full of errors. (Once the data has been processed to the point of considering credit-worthiness, then it’s all clean and perfect, yeah..?) Therefore, you have to be careful when you report on this data, as it’s easy to get nonsense figures that will send your marketing and risk people off in the wrong direction.

Two really common errors seen in numerical application data:

1. Putting gross yearly salary into the net monthly income field, and it’s not always easy to spot which one it ought to be: if it says ‘£50000’, then it’s very likely to be an annual figure; but what about ‘£7000’? If monthly, that’s a really good wage; but I can assure you, people earning that much are still applying for credit.

2. Dates of birth: if unknown, it’s common to see ‘1st January 1900’ and similar. So when you convert it to age, the customer is over 100.

Also, if you happen to get credit scores with your data, you have to watch out for magic numbers like 9999 – which to Callcredit means “no [credit] score could be generated”, not that the customer has the credit-worthiness of Bill Gates or George Soros.

Hence, it’s fairly obvious, that if you’re including these figures in mean averages, you’re going to be giving a misleading impression, and people can infer the wrong thing. For example, say you’ve 99 applications with an average monthly income of £2000, but there’s a also an incorrect application with a figure of £50,000. If you report the mean, you’ll get an answer of £2480, instead of the correct £2010 (assuming that £50k salary translates to ~£3k take-home per month). However, if you report the median, you’ll get an answer of £2000, whether the incorrect data is in there or not.

In statistical parlance, the median is “a robust measure of central tendency”, whereas the mean is not. The median isn’t affected by a few outliers (at either end).

End note: Credit scores can be (roughly) normally distributed; for a normal distribution, the median and mean (and mode) are the same. But data doesn’t have to be normally distributed: e.g. call-waiting times follow the exponential distribution, where the median and mean are not the same.


, , , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: