Posts Tagged binning

Binning with decision trees

When building a credit risk scorecard, it’s standard practice to take a continuous variable and discretise (or ‘bin’) it into a small number of bands*. A common approach is to:

  1. Partition the variable into 10-20 subsets of equal size — This is called ‘fine classing’
  2. Use bad rates** to combine similar adjacent subsets, to produce a variable with fewer levels, while not overly reducing its significance — This is called ‘coarse classing’

There’s a neater, simpler way to work out a good set of bands for our continuous variables, using decision trees.

* A ‘proper’ statistician would never do this, but this is just what we do when we build credit risk scorecards. Please don’t blame me! 🙂
** Or odds ratio, or chi-squared, or whichever statistic makes most sense to you. Personally, I use Fisher’s exact p-value.

For the sake of example, let’s say we have a dataset composed of a binary outcome, Bad (e.g. ‘Went s payments down within t months: yes/no’), and a single explanatory variable: Age.

We’ll generate a dummy dataset:

# Let's add an air of sophistication by generating our variable
# from a truncated normal distribution:
Age <- rtruncnorm(1000,a=18,b=65,mean=35,sd=10); # 1000 samples

summary(Age); # Verify that 18 <= Age <= 65 ; run hist(Age) to check it looks 'normal'

# Dummy up a relationship between Age and the outcome:
z <- (-0.1 * Age) + 1.5;
prob <- 1/(1+exp(-z)); # inverse logit function
Bad <- rbinom(1000, 1, prob);
  0   1 
857 143

For this dummy dataset, our bad rate is 14.3%.

# Check the Age coefficient is ~ -0.1, and the intercept is ~ 1.5
# (they probably won't be very close)
glm(Bad ~ Age, family=binomial);
(Intercept)          Age  
    1.10617     -0.08565 

# As you can see, not very close.
# (What happens with 10,000 samples, instead of 1,000?)

rpart is an R function (and library) for creating decision / classification trees; see, for example, r-bloggers. Let’s try running rpart on our data:

library(rpart); # Load in the already-installed rpart library
library(rpart.plot); # For fancy-looking decision trees
rp <- rpart(Bad ~ Age); # Similar formula-based syntax to glm()

At the bottom of the diagram, you can see 4 leaf nodes, hence we have 4 age bands — which seems ok. However, look at the bottom-right node: it’s only got 7 cases in it! When building scorecards, we don’t want bands with so few cases in, they won’t be stable over time.

Fortunately, we can specify the minimum size a node can be:

rp <- rpart(Bad ~ Age, control=rpart.control(minbucket=50));

Only 3 bands this time, but the minimum band has 137 cases — much better. Let’s see what the thresholds are:

     Age      Age 
35.29226 25.94350

# Create the banded variable:
AgeBand <- cut(Age, breaks=c(-Inf, 25.94350, 35.29226, Inf), right=FALSE);

If we crosstab AgeBand and Bad, we get:

              0   1 Total      %
[-Inf,25.9)  94  43   137 31.387
[25.9,35.3) 290  66   356 18.539
[35.3, Inf) 473  34   507  6.706
Total       857 143  1000 14.300

(I used table to do the crosstab, then used addmargins and cbind to add the margins and percentages.)

Clearly, as age increases, the bad rate decreases significantly. The IV (Information Value) for AgeBand is 0.4954, so it’s a variable that would be a definite candidate for inclusion in our final scorecard.

Although this is an easy method of working out the bands, I’d still recommend the traditional method alongside, as (a) you’ve got more control over the combining of ‘fine’ classes, and over the relative percentages of bads in each band, and (b) it’s useful to have more than one discretised version of a variable available to the model building process — especially if your scorecard is based upon a logistic regression, and not a set of ‘weights of evidence’. The regression takes correlation into account, and hence the ‘best’ bands for a particular variable can be different once other, more information-rich variables have been added to the model.


, , , , ,

Leave a comment