Posts Tagged scorecards
As I think I’ve mentioned before, one of the ‘go-to’ stats in my scorecard-building toolkit is the p-value that results from performing Fisher’s Exact Test on contingency tables. It’s straightforward to generate (in most cases), and directly interpretable: it’s just a sum of the probabilities of ‘extreme’ tables. When I started building credit risk scorecards, and using the Information Value (IV) statistic, I had to satisfy myself that there was a sensible relationship between the two values. Now, my combinatoric skills are far too lacking to attempt a rigorous mathematical analysis, so naturally I turn to R and the far easier task of simulating lots of data!
I generated 10,000 2-by-2 tables at random, with cell counts between 5 and 100. Here’s a plot of the (base e) log of the resulting exact p-value, against the log of the IV:
(I’ve taken logs as the relationship is clearer.) As you can see, I’ve drawn in some lines for the typical levels of p-value that people care about (5%, 1% and 0.1%), and the same for the IV (0.02, 0.1, 0.3 and 0.5). In the main, it looks like you’d expect, no glaring outliers.
For fun, I’ll look at those that fall into the area (p_exact > 0.05) and (0.3 < IV < 0.5):
|p = 0.0751, IV = 0.332||
|p = 0.0613, IV = 0.321|
In both cases, the exact p-value says there’s not much evidence that the row/column categories are related to each other — yet the IV tells us there’s “strong evidence”! Of course, the answer is that there’s no one single measure of independence that covers all situations; see, for instance, the famous Anscombe’s Quartet for a visual representation.
Practically, for the situations in which I’m using these measures, it doesn’t matter: if I have at least one indication of significance, I may as well add another candidate variable to the logistic regression that’ll form the basis of my scorecard. If the model selection process doesn’t end up using it, that’s fine.
Anyway, I end with a minor mystery. In my previous post, I came up with an upper bound for the IV, which means I can scale my IV to be between zero and one. I presumed that this new scaled version would be more correlated with the exact p-value; after all, how can a relationship with an IV of 0.25, but an upper bound of 5, be less significant than one with an IV of 0.375, but an upper bound of 15 (say)? Proportionally, the former is twice as strong as the latter, no?
What I found was that the scaled version was consistently less correlated! Why would this be? Surely, the scaling is providing more information? I have some suspicions, but nothing concrete at present — hopefully, I can clear this up in a future post.
When building a credit risk scorecard, it’s standard practice to take a continuous variable and discretise (or ‘bin’) it into a small number of bands*. A common approach is to:
- Partition the variable into 10-20 subsets of equal size — This is called ‘fine classing’
- Use bad rates** to combine similar adjacent subsets, to produce a variable with fewer levels, while not overly reducing its significance — This is called ‘coarse classing’
There’s a neater, simpler way to work out a good set of bands for our continuous variables, using decision trees.
* A ‘proper’ statistician would never do this, but this is just what we do when we build credit risk scorecards. Please don’t blame me! 🙂
** Or odds ratio, or chi-squared, or whichever statistic makes most sense to you. Personally, I use Fisher’s exact p-value.
For the sake of example, let’s say we have a dataset composed of a binary outcome, Bad (e.g. ‘Went s payments down within t months: yes/no’), and a single explanatory variable: Age.
We’ll generate a dummy dataset:
# Let's add an air of sophistication by generating our variable # from a truncated normal distribution: library(truncnorm); Age <- rtruncnorm(1000,a=18,b=65,mean=35,sd=10); # 1000 samples summary(Age); # Verify that 18 <= Age <= 65 ; run hist(Age) to check it looks 'normal' # Dummy up a relationship between Age and the outcome: z <- (-0.1 * Age) + 1.5; prob <- 1/(1+exp(-z)); # inverse logit function Bad <- rbinom(1000, 1, prob); table(Bad); Bad 0 1 857 143
For this dummy dataset, our bad rate is 14.3%.
# Check the Age coefficient is ~ -0.1, and the intercept is ~ 1.5 # (they probably won't be very close) glm(Bad ~ Age, family=binomial); (Intercept) Age 1.10617 -0.08565 # As you can see, not very close. # (What happens with 10,000 samples, instead of 1,000?)
rpart is an R function (and library) for creating decision / classification trees; see, for example, r-bloggers. Let’s try running rpart on our data:
library(rpart); # Load in the already-installed rpart library library(rpart.plot); # For fancy-looking decision trees rp <- rpart(Bad ~ Age); # Similar formula-based syntax to glm() fancyRpartPlot(rp);
At the bottom of the diagram, you can see 4 leaf nodes, hence we have 4 age bands — which seems ok. However, look at the bottom-right node: it’s only got 7 cases in it! When building scorecards, we don’t want bands with so few cases in, they won’t be stable over time.
Fortunately, we can specify the minimum size a node can be:
rp <- rpart(Bad ~ Age, control=rpart.control(minbucket=50)); fancyRpartPlot(rp)
Only 3 bands this time, but the minimum band has 137 cases — much better. Let’s see what the thresholds are:
rp$splits[,"index"]; Age Age 35.29226 25.94350 # Create the banded variable: AgeBand <- cut(Age, breaks=c(-Inf, 25.94350, 35.29226, Inf), right=FALSE);
If we crosstab AgeBand and Bad, we get:
0 1 Total % [-Inf,25.9) 94 43 137 31.387 [25.9,35.3) 290 66 356 18.539 [35.3, Inf) 473 34 507 6.706 Total 857 143 1000 14.300
(I used table to do the crosstab, then used addmargins and cbind to add the margins and percentages.)
Clearly, as age increases, the bad rate decreases significantly. The IV (Information Value) for AgeBand is 0.4954, so it’s a variable that would be a definite candidate for inclusion in our final scorecard.
Although this is an easy method of working out the bands, I’d still recommend the traditional method alongside, as (a) you’ve got more control over the combining of ‘fine’ classes, and over the relative percentages of bads in each band, and (b) it’s useful to have more than one discretised version of a variable available to the model building process — especially if your scorecard is based upon a logistic regression, and not a set of ‘weights of evidence’. The regression takes correlation into account, and hence the ‘best’ bands for a particular variable can be different once other, more information-rich variables have been added to the model.
When I’m not writing SQL, it’s a good bet that I’m building a scorecard – in consumer finance, they’re ubiquitous: customer accept/reject, fraud detection, marketing, collections, etc.
In ideal circumstances, the code that performs the scoring of the data is completely within my control (usually written in T-SQL). However, I’ve recently built a scorecard that needed to be implemented in a third-party system – and this particular system just spits out a score, it can’t currently tell you the contribution of each variable. But why might this be a problem? I mean, I specified the scorecard in the first place, and I can get at the raw data, so…?
As is typical, the issue arises when something goes wrong: data is run through the scoring system, but the score isn’t as expected. How can we help ourselves with debugging?
Here’s an example scorecard, which for the purposes of demonstration, I’ve written in SQL:
SELECT Score = 497 -- Customer age in years + CASE WHEN Age < 20 THEN -10 WHEN Age < 40 THEN -2 WHEN Age < 60 THEN 2 WHEN Age < 80 THEN 10 END -- BLR = Balance-to-Limit Ratio, as % + CASE WHEN BLR < 50.0 THEN 30 WHEN BLR < 90.0 THEN 10 WHEN BLR < 100.0 THEN 2 WHEN BLR >= 100.0 THEN -45 END -- (A real scorecard would have more variables.) FROM dbo.Data
Given just these two variables, you can see that there are two ways in which the overall score could be 509: either the customer is aged between 40 and 59, with a balance-limit ratio (BLR) between 50.0% and 89.9%; or they are aged between 60 and 79, with a BLR between 90.0% and 99.9%. But which is it? As it stands, we can’t know from the score alone.
However, if we make some minor changes, it becomes possible. Take a look at the following tweaks to the CASE statements:
CASE WHEN Age < 20 THEN -10.0001 WHEN Age < 40 THEN -2.0002 WHEN Age < 60 THEN 2.0004 WHEN Age < 80 THEN 10.0008 END CASE WHEN BLR < 50.0 THEN 30.0016 WHEN BLR < 90.0 THEN 10.0032 WHEN BLR < 100.0 THEN 2.0064 WHEN BLR >= 100.0 THEN -45.0128 END
To each possible score, I’ve added a small decimal that is a power of 2, divided by some large enough power of 10. Now we can work out from the overall score alone, exactly how it was built up.
A customer aged between 40 and 59, with a balance-limit ratio (BLR) between 50.0% and 89.9%, now has a score of 509.0036; whereas a customer aged between 60 and 79, with a BLR between 90.0% and 99.9% now has a score of 509.0072. And there is only one way in each case that the score could have been achieved.
Fairly obviously, you should round the score before making the Pass/Refer/Fail decision. But if it’s not possible, those extra digits after the decimal point are only going to have a tiny effect on the predictive power of the scorecard.
(Of course, the above only works if your scoring system can cope with decimal scores. If it can’t, you’re out of luck.)
One final thing: I’ve recently seen some scorecard errors that were due the scorecard not specifying scores for all possible values of a variable. The commonly accepted convention is that if a value isn’t present (usually blank or NULL), then it contributes zero – but if you aren’t aware of all the possible values, or the code is wrong, then you’ll have problems. If it’s under my control, I make sure all possible values are covered (even if they’d normally score zero), and add a catch-all for each variable, e.g.:
CASE WHEN Age >= 18 AND Age < 20 THEN -10.0001 WHEN Age >= 20 AND Age < 40 THEN -2.0002 WHEN Age >= 40 AND Age < 50 THEN 0.0004 WHEN Age >= 50 AND Age < 60 THEN 2.0008 WHEN Age >= 60 AND Age < 80 THEN 10.0016 ELSE -99999.99 END
, and then Decline or Refer all scores less than zero.
Do you have any scorecard tricks that have made your life easier? Please tell us about them in the comments below.