# Posts Tagged R

### Brute-force packing

Because I’m a huge music fan, I own quite a few CDs; rather more than I’d readily admit to! A few years ago, I started ripping my collection to FLAC, a lossless audio format, and backing everything up to DVD-R. In order to be efficient, I want to get as many albums as I can on each DVD-R disc, minimising the space wasted — this is known as a Packing Problem.

Now, I’d love to be able to present you with a highly-optimised algorithm that I wrote, but that’s not what I did: I brute-forced it. Processor cycles are very cheap, and if it’s going to be orders of magnitude quicker to iterate a few million times than it will be to research a whole new area of maths, then iterating it’ll be. My original code was a VB app (so I could drag folders onto a GUI), but here’s a similar version of the code in R:

set.seed(1234);
containerSize <- 4500; # roughly DVD size in MB
itemSize <- c(1641,1498,1747,751,1090,164,1602,1020,1126,553); # album sizes in MB
cat(sprintf("No. containers needed (no partitioning): %5.2f\n", sum(itemSize) / containerSize));

Z <- 1000; # Number of iterations

# To keep track of the best partition
best.remainder <- 1.0;
best.partition <- NULL;

for(i in 1:Z) {

working <- sample(itemSize); # randomly re-order our list of sizes
partition <- list();
k <- 1;
# Using the order as per 'working', partition the items
# such that the container size isn't exceeded:
while (length(working) > 0) {
this.partition.indexes <- which( cumsum(working) <= containerSize );
partition[[k]] <- working[this.partition.indexes];
working <- working[-(this.partition.indexes)];
k <- k+1;
}
npm1 <- length(partition) - 1; # Number of partitions minus 1
partition.totals <- unlist(lapply(partition, sum));
remainder <- (sum(rep(containerSize, npm1) - partition.totals[1:npm1]))
/ (npm1 * containerSize);

if (remainder < best.remainder) {
best.remainder <- remainder;
best.partition <- partition;
totals.str <- paste("(", paste(partition.totals, collapse=","), ")", sep="");
partition.str <- paste(unlist(lapply(partition,
function(x) paste("(",paste(x,collapse=","),")",sep=""))),collapse=",")
cat(sprintf("i = %3d, rem. = %5.2f%%; totals = %s; partition = %s\n", i,
remainder * 100.0), totals.str, partition.str));
}

} # end for loop


This code (1000 iterations) runs in the blink of an eye:

i =   1, rem. = 19.00%; totals = (3772,3518,3902);
partition = (1498,164,1090,1020),(1126,751,1641),(1602,553,1747)
i =   2, rem. = 13.56%; totals = (4439,3341,3412);
partition = (1602,1090,1747),(553,1498,1126,164),(1641,1020,751)
i =   4, rem. = 13.18%; totals = (3963,3851,3378);
partition = (1090,1747,1126),(751,1498,1602),(1641,553,164,1020)
i =   6, rem. =  4.78%; totals = (4303,4267,2622);
partition = (1641,1747,164,751),(553,1126,1498,1090),(1020,1602)
i =  13, rem. =  4.04%; totals = (4301,4335,2556);
partition = (1020,1126,553,1602),(1747,1498,1090),(751,1641,164)
i =  23, rem. =  0.26%; totals = (4478,4499,2215);
partition = (1090,1641,1747),(1020,1126,1602,751),(1498,553,164)
i = 524, rem. =  0.02%; totals = (4499,4499,2194);
partition = (1126,1602,751,1020),(1747,1498,1090,164),(553,1641)


The figure rem. is the percentage of space wasted on the discs that could be full — clearly, not all the discs can be 100% full. So in this case, I knew I was going to be burning three DVD-Rs, but there’s only 1 MB of unused space on each of the first two discs; for the third, I can either find some other files to backup, or keep those two albums to burn later — which is what I usually do; saving even more space, by repeatedly putting off burning the least-full disc.

### Information Value

Despite having worked with it for years, it has always irked me that I don’t know the derivation of the Information Value (IV) statistic.

It’s used liberally throughout credit risk work, but the background to its invention seems somewhat hazy. Clearly it’s related to Shannon Entropy, via the $\sum p \log(p)$ construct. In Naeem Siddiqi’s well-known book Credit Risk Scorecards, he writes “Information Value, […] comes from information theory” and references Kulback’s 1959 book Information Theory and Statistics, which I don’t have. Someone else suggested that it stems from the work of I.J. Good, but I can’t find an explicit definition in any of his papers I’ve managed to look at. (I bought his book Good Thinking, about the foundations of probability and statistical interference, but it’s waaaay too complex for me!)

The Information Value (IV) is defined as:

$\mathrm{IV} = \sum_{i=1}^{k} (g_{i} - b_{i}) \log_e (g_{i} / b_{i})$

, where $g_{i}$ is the number of ‘goods’ in category i, and $b_{i}$ is the number of ‘bads’.

In his book, Siddiqi gives the following rule of thumb regarding the value of IV:

 < 0.02 unpredictive 0.02 to 0.1 weak 0.1 to 0.3 medium 0.3 to 0.5 strong 0.5+ “should be checked for over-predicting”

For an independent variable with an IV over 0.5, it might be somehow related to the dependent variable, and you might want to consider leaving it out. (If you build a scorecard that has a bureau score as one of your variables, then you’ll almost certainly see this.)

[See these two links for more about Information Value, and an example or two of its use: All about “Information Value” and Information Value (IV) and Weight of Evidence (WOE).]

### Upper Bound

The lower bound of the IV is fairly obviously zero: if $g_{i} \equiv b_{i}$ for all the categories, then the difference is zero, so their sum is zero times $\log_{e}(1)$, which is also zero. But what about the upper bound?

I’ve put together this small PDF document: Upper bound of the Information Value (IV), in which (I think!) I show that the upper bound is very close to $\log_{e}(N_{G}) + \log_{e}(N_{B})$, where $N_G$ is the total number of goods, and $N_B$ is the total number of bads.

Of course, it’s wise to at least check the result with some code — so in R, let’s create a million tables at random, and look at the actual figures that are produced:

Z <- 1000000; # number of iterations
IV <- rep(0, Z); # array of IVs
lGB <- rep(0, Z); # array of (log(n_g) + log(n_b))

for (i in 1:Z)
{
k <- sample(2:20, 1); # number of categories
g <- sample(1:100, k, replace=T); # good
b <- sample(1:100, k, replace=T); # bad
ng <- sum(g);
nb <- sum(b);
IV[i] <- sum( ((g/ng)-(b/nb)) * log((g/ng)/(b/nb)) );
lGB[i] <- log(ng) + log(nb);
}
plot(IV, lGB, xlab="IV", ylab="log(N_G)+log(N_B)",
main="IV vs log(N_G)+log(N_B)", pch=19,col="blue",cex=0.5);
abline(a=0,b=1,col="red",lwd=2); # draw the line x=y


As you can see, there are no points below the red ‘x=y’ line; in other words, the IV is always less than $\log_{e}(N_{G}) + \log_{e}(N_{B})$. There are a few points that are close; the closest is:

min(lGB-IV)
[1] 0.2161227


I know that $\log_{e}(N_{G}) + \log_{e}(N_{B})$ is not the best possible upper bound — a closer, but more complex answer is reasonably obvious from the document — but “log(number of goods) plus log(number of bads)” is (a) memorable, and (b) close enough for me!

### Floats may not look distinct

The temporary table #Data contains the following:


SELECT * FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)


Three copies of the same number, right? However:


SELECT DISTINCT value FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)


We have the exact same result set. How can this be?

It’s because what’s being displayed isn’t necessarily what’s stored internally. This should make it clearer:


SELECT remainder = (value - 123.456) FROM #Data
GO

remainder
----------------------
9.9475983006414E-14
1.4210854715202E-14
0

(3 row(s) affected)


The numbers aren’t all 123.456 exactly; the data is in floating-point format, and two of the values were ever-so-slightly larger. The lesson is: be very careful when using aggregate functions on columns declared as type float.

Some other observations:

• The above will probably be reminiscent to anyone who’s done much text wrangling in SQL. Strings look identical to the eye, but different to SQL Server’s processing engine; you end up having to examine every character, finding and eliminating extraneous tabs (ASCII code 9), carriage returns (ASCII code 13), line-feeds (ASCII code 10), or even weirder.
• If your requirement warrants it, I can thoroughly recommend the GNU Multiple Precision Arithmetic Library, which stores numbers to arbitrary precision. It’s available as libraries for C/C++, and as the R package gmp:

# In R:

> choose(200,50);  # This is 200! / (150! 50!)
[1] 4.538584e+47
> library(gmp);
Attaching package: ‘gmp’
> chooseZ(200,50);
Big Integer ('bigz') :
[1] 453858377923246061067441390280868162761998660528

# Dividing numbers:
> as.bigz(123456789012345678901234567890) / as.bigz(9876543210)
Big Rational ('bigq') :
[1] 61728394506172838938859798528 / 4938271605
# ^^ the result is stored as a rational, in canonical form.