# Posts Tagged discretisation

### Binning with decision trees

Posted by sqlpete in scorecards, stats on January 23, 2017

When building a credit risk scorecard, it’s standard practice to take a continuous variable and discretise (or ‘bin’) it into a small number of bands^{*}. A common approach is to:

- Partition the variable into 10-20 subsets of equal size —
*This is called ‘fine classing’* - Use bad rates
^{**}to combine similar adjacent subsets, to produce a variable with fewer levels, while not overly reducing its significance —*This is called ‘coarse classing’*

There’s a neater, simpler way to work out a good set of bands for our continuous variables, using **decision trees**.

^{*} A ‘proper’ statistician would never do this, but this is just what we do when we build credit risk scorecards. Please don’t blame me! ðŸ™‚

^{**} Or odds ratio, or chi-squared, or whichever statistic makes most sense to you. Personally, I use Fisher’s exact p-value.

For the sake of example, let’s say we have a dataset composed of a binary outcome, **Bad** (e.g. ‘Went *s* payments down within *t* months: yes/no’), and a single explanatory variable: **Age**.

We’ll generate a dummy dataset:

```
# Let's add an air of sophistication by generating our variable
# from a truncated normal distribution:
library(truncnorm);
Age <- rtruncnorm(1000,a=18,b=65,mean=35,sd=10); # 1000 samples
summary(Age); # Verify that 18 <= Age <= 65 ; run hist(Age) to check it looks 'normal'
# Dummy up a relationship between Age and the outcome:
z <- (-0.1 * Age) + 1.5;
prob <- 1/(1+exp(-z)); # inverse logit function
Bad <- rbinom(1000, 1, prob);
table(Bad);
Bad
0 1
857 143
```

For this dummy dataset, our bad rate is **14.3%**.

```
# Check the Age coefficient is ~ -0.1, and the intercept is ~ 1.5
# (they probably won't be very close)
glm(Bad ~ Age, family=binomial);
(Intercept) Age
1.10617 -0.08565
# As you can see, not very close.
# (What happens with 10,000 samples, instead of 1,000?)
```

** rpart** is an R function (and library) for creating decision / classification trees; see, for example, r-bloggers. Let’s try running

`rpart`on our data:

```
library(rpart); # Load in the already-installed rpart library
library(rpart.plot); # For fancy-looking decision trees
rp <- rpart(Bad ~ Age); # Similar formula-based syntax to glm()
fancyRpartPlot(rp);
```

At the bottom of the diagram, you can see **4** leaf nodes, hence we have 4 age bands — which seems ok. However, look at the bottom-right node: it’s only got 7 cases in it! When building scorecards, we don’t want bands with so few cases in, they won’t be stable over time.

Fortunately, we can specify the minimum size a node can be:

```
rp <- rpart(Bad ~ Age, control=rpart.control(minbucket=50));
fancyRpartPlot(rp)
```

Only **3** bands this time, but the minimum band has **137** cases — much better. Let’s see what the thresholds are:

```
rp$splits[,"index"];
Age Age
35.29226 25.94350
# Create the banded variable:
AgeBand <- cut(Age, breaks=c(-Inf, 25.94350, 35.29226, Inf), right=FALSE);
```

If we crosstab ** AgeBand** and

**, we get:**

`Bad````
0 1 Total %
[-Inf,25.9) 94 43 137 31.387
[25.9,35.3) 290 66 356 18.539
[35.3, Inf) 473 34 507 6.706
Total 857 143 1000 14.300
```

(I used ** table** to do the crosstab, then used

**and**

`addmargins`**to add the margins and percentages.)**

`cbind`Clearly, as age increases, the bad rate decreases significantly. The IV (**I**nformation **V**alue) for ** AgeBand** is

**0.4954**, so it’s a variable that would be a definite candidate for inclusion in our final scorecard.

Although this is an easy method of working out the bands, I’d still recommend the traditional method alongside, as (a) you’ve got more control over the combining of ‘fine’ classes, and over the relative percentages of bads in each band, and (b) it’s useful to have more than one discretised version of a variable available to the model building process — especially if your scorecard is based upon a logistic regression, and not a set of ‘weights of evidence’. The regression takes correlation into account, and hence the ‘best’ bands for a particular variable can be different once other, more information-rich variables have been added to the model.

## Recent Comments