# Archive for category stats

### Exact p-value versus Information Value

As I think I’ve mentioned before, one of the ‘go-to’ stats in my scorecard-building toolkit is the p-value that results from performing Fisher’s Exact Test on contingency tables. It’s straightforward to generate (in most cases), and directly interpretable: it’s just a sum of the probabilities of ‘extreme’ tables. When I started building credit risk scorecards, and using the Information Value (IV) statistic, I had to satisfy myself that there was a sensible relationship between the two values. Now, my combinatoric skills are far too lacking to attempt a rigorous mathematical analysis, so naturally I turn to R and the far easier task of simulating lots of data!

I generated 10,000 2-by-2 tables at random, with cell counts between 5 and 100. Here’s a plot of the (base e) log of the resulting exact p-value, against the log of the IV:

(I’ve taken logs as the relationship is clearer.) As you can see, I’ve drawn in some lines for the typical levels of p-value that people care about (5%, 1% and 0.1%), and the same for the IV (0.02, 0.1, 0.3 and 0.5). In the main, it looks like you’d expect, no glaring outliers.

For fun, I’ll look at those that fall into the area (p_exact > 0.05) and (0.3 < IV < 0.5):

 6 29 10 15
p = 0.0751, IV = 0.332
 5 84 7 37
p = 0.0613, IV = 0.321

In both cases, the exact p-value says there’s not much evidence that the row/column categories are related to each other — yet the IV tells us there’s “strong evidence”! Of course, the answer is that there’s no one single measure of independence that covers all situations; see, for instance, the famous Anscombe’s Quartet for a visual representation.

Practically, for the situations in which I’m using these measures, it doesn’t matter: if I have at least one indication of significance, I may as well add another candidate variable to the logistic regression that’ll form the basis of my scorecard. If the model selection process doesn’t end up using it, that’s fine.

Anyway, I end with a minor mystery. In my previous post, I came up with an upper bound for the IV, which means I can scale my IV to be between zero and one. I presumed that this new scaled version would be more correlated with the exact p-value; after all, how can a relationship with an IV of 0.25, but an upper bound of 5, be less significant than one with an IV of 0.375, but an upper bound of 15 (say)? Proportionally, the former is twice as strong as the latter, no?

What I found was that the scaled version was consistently less correlated! Why would this be? Surely, the scaling is providing more information? I have some suspicions, but nothing concrete at present — hopefully, I can clear this up in a future post.

### Information Value

Despite having worked with it for years, it has always irked me that I don’t know the derivation of the Information Value (IV) statistic.

It’s used liberally throughout credit risk work, but the background to its invention seems somewhat hazy. Clearly it’s related to Shannon Entropy, via the $\sum p \log(p)$ construct. In Naeem Siddiqi’s well-known book Credit Risk Scorecards, he writes “Information Value, […] comes from information theory” and references Kulback’s 1959 book Information Theory and Statistics, which I don’t have. Someone else suggested that it stems from the work of I.J. Good, but I can’t find an explicit definition in any of his papers I’ve managed to look at. (I bought his book Good Thinking, about the foundations of probability and statistical interference, but it’s waaaay too complex for me!)

The Information Value (IV) is defined as:

$\mathrm{IV} = \sum_{i=1}^{k} (g_{i} - b_{i}) \log_e (g_{i} / b_{i})$

, where $g_{i}$ is the number of ‘goods’ in category i, and $b_{i}$ is the number of ‘bads’.

In his book, Siddiqi gives the following rule of thumb regarding the value of IV:

 < 0.02 unpredictive 0.02 to 0.1 weak 0.1 to 0.3 medium 0.3 to 0.5 strong 0.5+ “should be checked for over-predicting”

For an independent variable with an IV over 0.5, it might be somehow related to the dependent variable, and you might want to consider leaving it out. (If you build a scorecard that has a bureau score as one of your variables, then you’ll almost certainly see this.)

[See these two links for more about Information Value, and an example or two of its use: All about “Information Value” and Information Value (IV) and Weight of Evidence (WOE).]

### Upper Bound

The lower bound of the IV is fairly obviously zero: if $g_{i} \equiv b_{i}$ for all the categories, then the difference is zero, so their sum is zero times $\log_{e}(1)$, which is also zero. But what about the upper bound?

I’ve put together this small PDF document: Upper bound of the Information Value (IV), in which (I think!) I show that the upper bound is very close to $\log_{e}(N_{G}) + \log_{e}(N_{B})$, where $N_G$ is the total number of goods, and $N_B$ is the total number of bads.

Of course, it’s wise to at least check the result with some code — so in R, let’s create a million tables at random, and look at the actual figures that are produced:

Z <- 1000000; # number of iterations
IV <- rep(0, Z); # array of IVs
lGB <- rep(0, Z); # array of (log(n_g) + log(n_b))

for (i in 1:Z)
{
k <- sample(2:20, 1); # number of categories
g <- sample(1:100, k, replace=T); # good
b <- sample(1:100, k, replace=T); # bad
ng <- sum(g);
nb <- sum(b);
IV[i] <- sum( ((g/ng)-(b/nb)) * log((g/ng)/(b/nb)) );
lGB[i] <- log(ng) + log(nb);
}
plot(IV, lGB, xlab="IV", ylab="log(N_G)+log(N_B)",
main="IV vs log(N_G)+log(N_B)", pch=19,col="blue",cex=0.5);
abline(a=0,b=1,col="red",lwd=2); # draw the line x=y


As you can see, there are no points below the red ‘x=y’ line; in other words, the IV is always less than $\log_{e}(N_{G}) + \log_{e}(N_{B})$. There are a few points that are close; the closest is:

min(lGB-IV)
[1] 0.2161227


I know that $\log_{e}(N_{G}) + \log_{e}(N_{B})$ is not the best possible upper bound — a closer, but more complex answer is reasonably obvious from the document — but “log(number of goods) plus log(number of bads)” is (a) memorable, and (b) close enough for me!

### Binning with decision trees

When building a credit risk scorecard, it’s standard practice to take a continuous variable and discretise (or ‘bin’) it into a small number of bands*. A common approach is to:

1. Partition the variable into 10-20 subsets of equal size — This is called ‘fine classing’
2. Use bad rates** to combine similar adjacent subsets, to produce a variable with fewer levels, while not overly reducing its significance — This is called ‘coarse classing’

There’s a neater, simpler way to work out a good set of bands for our continuous variables, using decision trees.

* A ‘proper’ statistician would never do this, but this is just what we do when we build credit risk scorecards. Please don’t blame me! ðŸ™‚
** Or odds ratio, or chi-squared, or whichever statistic makes most sense to you. Personally, I use Fisher’s exact p-value.

For the sake of example, let’s say we have a dataset composed of a binary outcome, Bad (e.g. ‘Went s payments down within t months: yes/no’), and a single explanatory variable: Age.

We’ll generate a dummy dataset:

# Let's add an air of sophistication by generating our variable
# from a truncated normal distribution:
library(truncnorm);
Age <- rtruncnorm(1000,a=18,b=65,mean=35,sd=10); # 1000 samples

summary(Age); # Verify that 18 <= Age <= 65 ; run hist(Age) to check it looks 'normal'

# Dummy up a relationship between Age and the outcome:
z <- (-0.1 * Age) + 1.5;
prob <- 1/(1+exp(-z)); # inverse logit function
0   1
857 143


For this dummy dataset, our bad rate is 14.3%.

# Check the Age coefficient is ~ -0.1, and the intercept is ~ 1.5
# (they probably won't be very close)
(Intercept)          Age
1.10617     -0.08565

# As you can see, not very close.
# (What happens with 10,000 samples, instead of 1,000?)


rpart is an R function (and library) for creating decision / classification trees; see, for example, r-bloggers. Let’s try running rpart on our data:

library(rpart); # Load in the already-installed rpart library
library(rpart.plot); # For fancy-looking decision trees
rp <- rpart(Bad ~ Age); # Similar formula-based syntax to glm()
fancyRpartPlot(rp);


At the bottom of the diagram, you can see 4 leaf nodes, hence we have 4 age bands — which seems ok. However, look at the bottom-right node: it’s only got 7 cases in it! When building scorecards, we don’t want bands with so few cases in, they won’t be stable over time.

Fortunately, we can specify the minimum size a node can be:

rp <- rpart(Bad ~ Age, control=rpart.control(minbucket=50));
fancyRpartPlot(rp)

Only 3 bands this time, but the minimum band has 137 cases — much better. Let’s see what the thresholds are:


rp\$splits[,"index"];
Age      Age
35.29226 25.94350

# Create the banded variable:
AgeBand <- cut(Age, breaks=c(-Inf, 25.94350, 35.29226, Inf), right=FALSE);


If we crosstab AgeBand and Bad, we get:


0   1 Total      %
[-Inf,25.9)  94  43   137 31.387
[25.9,35.3) 290  66   356 18.539
[35.3, Inf) 473  34   507  6.706
Total       857 143  1000 14.300


(I used table to do the crosstab, then used addmargins and cbind to add the margins and percentages.)

Clearly, as age increases, the bad rate decreases significantly. The IV (Information Value) for AgeBand is 0.4954, so it’s a variable that would be a definite candidate for inclusion in our final scorecard.

Although this is an easy method of working out the bands, I’d still recommend the traditional method alongside, as (a) you’ve got more control over the combining of ‘fine’ classes, and over the relative percentages of bads in each band, and (b) it’s useful to have more than one discretised version of a variable available to the model building process — especially if your scorecard is based upon a logistic regression, and not a set of ‘weights of evidence’. The regression takes correlation into account, and hence the ‘best’ bands for a particular variable can be different once other, more information-rich variables have been added to the model.

### Using median rather than mean

Analysts in consumer finance are used to dealing with large sets of incoming customer data, and are often asked to provide summary statistics to show whether application quality is changing over time: Are credit scores going down? What’s the average monthly income? What’s the average age of the customers applying this quarter, compared to last?

Early on in the process, the data is likely to be messy, unsanitised, and potentially chock-full of errors. (Once the data has been processed to the point of considering credit-worthiness, then it’s all clean and perfect, yeah..?) Therefore, you have to be careful when you report on this data, as it’s easy to get nonsense figures that will send your marketing and risk people off in the wrong direction.

Two really common errors seen in numerical application data:

1. Putting gross yearly salary into the net monthly income field, and it’s not always easy to spot which one it ought to be: if it says ‘Â£50000’, then it’s very likely to be an annual figure; but what about ‘Â£7000’? If monthly, that’s a really good wage; but I can assure you, people earning that much are still applying for credit.

2. Dates of birth: if unknown, it’s common to see ‘1st January 1900’ and similar. So when you convert it to age, the customer is over 100.

Also, if you happen to get credit scores with your data, you have to watch out for magic numbers like 9999 – which to Callcredit means “no [credit] score could be generated”, not that the customer has the credit-worthiness of Bill Gates or George Soros.

Hence, it’s fairly obvious, that if you’re including these figures in mean averages, you’re going to be giving a misleading impression, and people can infer the wrong thing. For example, say you’ve 99 applications with an average monthly income of Â£2000, but there’s a also an incorrect application with a figure of Â£50,000. If you report the mean, you’ll get an answer of Â£2480, instead of the correct Â£2010 (assuming that Â£50k salary translates to ~Â£3k take-home per month). However, if you report the median, you’ll get an answer of Â£2000, whether the incorrect data is in there or not.

In statistical parlance, the median is “a robust measure of central tendency”, whereas the mean is not. The median isn’t affected by a few outliers (at either end).

End note: Credit scores can be (roughly) normally distributed; for a normal distribution, the median and mean (and mode) are the same. But data doesn’t have to be normally distributed: e.g. call-waiting times follow the exponential distribution, where the median and mean are not the same.