# Archive for February, 2017

### Brute-force packing

Because I’m a huge music fan, I own quite a few CDs; rather more than I’d readily admit to! A few years ago, I started ripping my collection to FLAC, a lossless audio format, and backing everything up to DVD-R. In order to be efficient, I want to get as many albums as I can on each DVD-R disc, minimising the space wasted — this is known as a Packing Problem.

Now, I’d love to be able to present you with a highly-optimised algorithm that I wrote, but that’s not what I did: I brute-forced it. Processor cycles are very cheap, and if it’s going to be orders of magnitude quicker to iterate a few million times than it will be to research a whole new area of maths, then iterating it’ll be. My original code was a VB app (so I could drag folders onto a GUI), but here’s a similar version of the code in R:

set.seed(1234);
containerSize <- 4500; # roughly DVD size in MB
itemSize <- c(1641,1498,1747,751,1090,164,1602,1020,1126,553); # album sizes in MB
cat(sprintf("No. containers needed (no partitioning): %5.2f\n", sum(itemSize) / containerSize));

Z <- 1000; # Number of iterations

# To keep track of the best partition
best.remainder <- 1.0;
best.partition <- NULL;

for(i in 1:Z) {

working <- sample(itemSize); # randomly re-order our list of sizes
partition <- list();
k <- 1;
# Using the order as per 'working', partition the items
# such that the container size isn't exceeded:
while (length(working) > 0) {
this.partition.indexes <- which( cumsum(working) <= containerSize );
partition[[k]] <- working[this.partition.indexes];
working <- working[-(this.partition.indexes)];
k <- k+1;
}
npm1 <- length(partition) - 1; # Number of partitions minus 1
partition.totals <- unlist(lapply(partition, sum));
remainder <- (sum(rep(containerSize, npm1) - partition.totals[1:npm1]))
/ (npm1 * containerSize);

if (remainder < best.remainder) {
best.remainder <- remainder;
best.partition <- partition;
totals.str <- paste("(", paste(partition.totals, collapse=","), ")", sep="");
partition.str <- paste(unlist(lapply(partition,
function(x) paste("(",paste(x,collapse=","),")",sep=""))),collapse=",")
cat(sprintf("i = %3d, rem. = %5.2f%%; totals = %s; partition = %s\n", i,
remainder * 100.0), totals.str, partition.str));
}

} # end for loop


This code (1000 iterations) runs in the blink of an eye:

i =   1, rem. = 19.00%; totals = (3772,3518,3902);
partition = (1498,164,1090,1020),(1126,751,1641),(1602,553,1747)
i =   2, rem. = 13.56%; totals = (4439,3341,3412);
partition = (1602,1090,1747),(553,1498,1126,164),(1641,1020,751)
i =   4, rem. = 13.18%; totals = (3963,3851,3378);
partition = (1090,1747,1126),(751,1498,1602),(1641,553,164,1020)
i =   6, rem. =  4.78%; totals = (4303,4267,2622);
partition = (1641,1747,164,751),(553,1126,1498,1090),(1020,1602)
i =  13, rem. =  4.04%; totals = (4301,4335,2556);
partition = (1020,1126,553,1602),(1747,1498,1090),(751,1641,164)
i =  23, rem. =  0.26%; totals = (4478,4499,2215);
partition = (1090,1641,1747),(1020,1126,1602,751),(1498,553,164)
i = 524, rem. =  0.02%; totals = (4499,4499,2194);
partition = (1126,1602,751,1020),(1747,1498,1090,164),(553,1641)


The figure rem. is the percentage of space wasted on the discs that could be full — clearly, not all the discs can be 100% full. So in this case, I knew I was going to be burning three DVD-Rs, but there’s only 1 MB of unused space on each of the first two discs; for the third, I can either find some other files to backup, or keep those two albums to burn later — which is what I usually do; saving even more space, by repeatedly putting off burning the least-full disc.

### Exact p-value versus Information Value

As I think I’ve mentioned before, one of the ‘go-to’ stats in my scorecard-building toolkit is the p-value that results from performing Fisher’s Exact Test on contingency tables. It’s straightforward to generate (in most cases), and directly interpretable: it’s just a sum of the probabilities of ‘extreme’ tables. When I started building credit risk scorecards, and using the Information Value (IV) statistic, I had to satisfy myself that there was a sensible relationship between the two values. Now, my combinatoric skills are far too lacking to attempt a rigorous mathematical analysis, so naturally I turn to R and the far easier task of simulating lots of data!

I generated 10,000 2-by-2 tables at random, with cell counts between 5 and 100. Here’s a plot of the (base e) log of the resulting exact p-value, against the log of the IV:

(I’ve taken logs as the relationship is clearer.) As you can see, I’ve drawn in some lines for the typical levels of p-value that people care about (5%, 1% and 0.1%), and the same for the IV (0.02, 0.1, 0.3 and 0.5). In the main, it looks like you’d expect, no glaring outliers.

For fun, I’ll look at those that fall into the area (p_exact > 0.05) and (0.3 < IV < 0.5):

 6 29 10 15
p = 0.0751, IV = 0.332
 5 84 7 37
p = 0.0613, IV = 0.321

In both cases, the exact p-value says there’s not much evidence that the row/column categories are related to each other — yet the IV tells us there’s “strong evidence”! Of course, the answer is that there’s no one single measure of independence that covers all situations; see, for instance, the famous Anscombe’s Quartet for a visual representation.

Practically, for the situations in which I’m using these measures, it doesn’t matter: if I have at least one indication of significance, I may as well add another candidate variable to the logistic regression that’ll form the basis of my scorecard. If the model selection process doesn’t end up using it, that’s fine.

Anyway, I end with a minor mystery. In my previous post, I came up with an upper bound for the IV, which means I can scale my IV to be between zero and one. I presumed that this new scaled version would be more correlated with the exact p-value; after all, how can a relationship with an IV of 0.25, but an upper bound of 5, be less significant than one with an IV of 0.375, but an upper bound of 15 (say)? Proportionally, the former is twice as strong as the latter, no?

What I found was that the scaled version was consistently less correlated! Why would this be? Surely, the scaling is providing more information? I have some suspicions, but nothing concrete at present — hopefully, I can clear this up in a future post.

### Information Value

Despite having worked with it for years, it has always irked me that I don’t know the derivation of the Information Value (IV) statistic.

It’s used liberally throughout credit risk work, but the background to its invention seems somewhat hazy. Clearly it’s related to Shannon Entropy, via the $\sum p \log(p)$ construct. In Naeem Siddiqi’s well-known book Credit Risk Scorecards, he writes “Information Value, […] comes from information theory” and references Kulback’s 1959 book Information Theory and Statistics, which I don’t have. Someone else suggested that it stems from the work of I.J. Good, but I can’t find an explicit definition in any of his papers I’ve managed to look at. (I bought his book Good Thinking, about the foundations of probability and statistical interference, but it’s waaaay too complex for me!)

The Information Value (IV) is defined as:

$\mathrm{IV} = \sum_{i=1}^{k} (g_{i} - b_{i}) \log_e (g_{i} / b_{i})$

, where $g_{i}$ is the number of ‘goods’ in category i, and $b_{i}$ is the number of ‘bads’.

In his book, Siddiqi gives the following rule of thumb regarding the value of IV:

 < 0.02 unpredictive 0.02 to 0.1 weak 0.1 to 0.3 medium 0.3 to 0.5 strong 0.5+ “should be checked for over-predicting”

For an independent variable with an IV over 0.5, it might be somehow related to the dependent variable, and you might want to consider leaving it out. (If you build a scorecard that has a bureau score as one of your variables, then you’ll almost certainly see this.)

[See these two links for more about Information Value, and an example or two of its use: All about “Information Value” and Information Value (IV) and Weight of Evidence (WOE).]

### Upper Bound

The lower bound of the IV is fairly obviously zero: if $g_{i} \equiv b_{i}$ for all the categories, then the difference is zero, so their sum is zero times $\log_{e}(1)$, which is also zero. But what about the upper bound?

I’ve put together this small PDF document: Upper bound of the Information Value (IV), in which (I think!) I show that the upper bound is very close to $\log_{e}(N_{G}) + \log_{e}(N_{B})$, where $N_G$ is the total number of goods, and $N_B$ is the total number of bads.

Of course, it’s wise to at least check the result with some code — so in R, let’s create a million tables at random, and look at the actual figures that are produced:

Z <- 1000000; # number of iterations
IV <- rep(0, Z); # array of IVs
lGB <- rep(0, Z); # array of (log(n_g) + log(n_b))

for (i in 1:Z)
{
k <- sample(2:20, 1); # number of categories
g <- sample(1:100, k, replace=T); # good
b <- sample(1:100, k, replace=T); # bad
ng <- sum(g);
nb <- sum(b);
IV[i] <- sum( ((g/ng)-(b/nb)) * log((g/ng)/(b/nb)) );
lGB[i] <- log(ng) + log(nb);
}
plot(IV, lGB, xlab="IV", ylab="log(N_G)+log(N_B)",
main="IV vs log(N_G)+log(N_B)", pch=19,col="blue",cex=0.5);
abline(a=0,b=1,col="red",lwd=2); # draw the line x=y


As you can see, there are no points below the red ‘x=y’ line; in other words, the IV is always less than $\log_{e}(N_{G}) + \log_{e}(N_{B})$. There are a few points that are close; the closest is:

min(lGB-IV)
[1] 0.2161227


I know that $\log_{e}(N_{G}) + \log_{e}(N_{B})$ is not the best possible upper bound — a closer, but more complex answer is reasonably obvious from the document — but “log(number of goods) plus log(number of bads)” is (a) memorable, and (b) close enough for me!

### Inexact date grouping using islands

A few years ago at SQLBits, I was fortunate enough to attend a fascinating lecture given by SQL Server MVP Itzik Ben-Gan on Gaps and islands:

Gaps and islands problems involve finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates.

Recently, I had to produce a dataset which showed how our interest rates varied over time, by product; for example, products A to E started at 10% through to 50%, but have been adjusted periodically to where they are today, a few points different. Practically, most of the changes have been made at or around the same time — but not exactly. For technical reasons, the specified rates aren’t stored in the database, so there’s no InterestRate table, or a parent table called InterestRateSet that links them together. However, the final result of an application is stored, so we know that a product has been sold, and what the corresponding interest rate was on that day.

The challenge was to work out how many sets of interest rates we’ve had since day 1; but because not every product is purchased every day, if we group by product/day, then it looks like our rates change more often than they do. This is where the ‘gaps and islands’ concept comes in, and luckily I remembered the lecture from a few years before. I found and tweaked some of Itzik’s code from a 2012 article Solving Gaps and Islands with Enhanced Window Functions (SQL Server Pro website) to accept a UDT (in this case, a list of dates). [See previous post, Passing structured data to stored procedures.]

Here it is:

-- Drop/Create our UDT:
IF EXISTS (SELECT * FROM sys.types WHERE is_table_type = 1 AND name = 'DateListType')
DROP TYPE DateListType
GO
CREATE TYPE DateListType AS TABLE ([Date] DATE)
GO

-- Drop/Create our function:
IF OBJECT_ID('dbo.fn_GetIslandsFromDateList') IS NOT NULL
DROP FUNCTION dbo.fn_GetIslandsFromDateList
GO
CREATE FUNCTION dbo.fn_GetIslandsFromDateList
(
,@GapSize INT
)
RETURNS TABLE
AS
RETURN
(
WITH cte_Distinct AS
(
SELECT
[Date]
FROM @dateListType
GROUP BY [Date]

), cte_Part1 AS
(
SELECT
[Date]
,CASE
WHEN DATEDIFF(day, LAG([Date]) OVER(ORDER BY [Date]), [Date]) <= @GapSize
THEN 0 ELSE 1 END AS IsStart
,CASE
WHEN DATEDIFF(day, [Date], LEAD([Date]) OVER(ORDER BY [Date])) <= @GapSize
THEN 0 ELSE 1 END AS IsEnd
FROM cte_Distinct
)
, cte_Part2 AS
(
SELECT
[Date] AS RangeStart
,CASE
WHEN IsEnd = 1
THEN [Date] ELSE LEAD([Date], 1) OVER(ORDER BY [Date]) END AS RangeEnd
,IsStart
FROM cte_Part1
WHERE IsStart = 1 OR IsEnd = 1
)
SELECT
ROW_NUMBER() OVER (ORDER BY RangeStart) AS ID
,RangeStart
,RangeEnd
FROM cte_Part2
WHERE IsStart = 1
)
GO


Some things to note:

• dbo.fn_GetIslandsFromDateList is an inline function, which you can almost think of as ‘a view that takes parameters’ (a parameterised view).
• You can use CTEs (Common Table Expressions) in inline functions. I love using CTEs, they can make the code very readable. Often, the parser turns them into standard sub-queries, so there’s no performance hit.
• The @GapSize parameter controls how far apart our islands can be — see the examples below.
• If you follow the code through, and break it down in to its component parts, you can see how it works — like all the best code, it’s very neat and compact.
• To re-emphasise, this isn’t my algorithm, it’s Itzik Ben-Gan’s; I’ve done little more than re-format it for my own use.

Let’s feed some dates into our function:

DECLARE @myList DateListType
INSERT @myList([Date])
VALUES('2017Feb01'),('2017Feb02'),('2017Feb03')
,('2017Feb06'),('2017Mar01'),('2017Mar02')

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO

ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-03
2    2017-02-06 2017-02-06
3    2017-03-01 2017-03-02


With a @GapSize of 2, we get 3 ranges (islands). With a @GapSize of 3:

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO
ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-06
2    2017-03-01 2017-03-02


, we get 2 ranges, because the difference in days between 2017-02-06 and 2017-02-03 is less than or equal to 3.

So this code did the trick, and allowed us to work out exactly how many different sets of rates we’d actually had live. Yes, we could’ve worked it out by hand; but now we’ve got some reproducible code that can drive various different reports, that’ll show us exactly how our changes have affected the business.

A final thought: Quite often, solving a problem comes down to just knowing the right phrase to google!

### Passing structured data to stored procedures

As I wrote about here, I like to pass structured data around using XML (where applicable). While I think it’s the most flexible way, it’s understandable that people might not want to learn XML-parsing syntax, if they don’t have to. Another way of passing data into stored procedures (sprocs) is to use user-defined types (UDTs).

Here’s a simple example:

-- If the UDT already exists, delete it:
IF EXISTS (
SELECT *
FROM sys.types
WHERE is_table_type = 1
AND [name] = 'ListOfDates'
AND [schema_id] = SCHEMA_ID('dbo')
)
BEGIN
DROP TYPE dbo.ListOfDates
END
GO

-- Create our UDT:
CREATE TYPE dbo.ListOfDates AS TABLE
(
DateTypeID TINYINT NOT NULL
,[Date] DATE NOT NULL
)
GO

-- Let's test it out:

SET DATEFORMAT YMD

DECLARE @MyListOfDates AS dbo.ListOfDates

INSERT @MyListOfDates(DateTypeID, [Date])
VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03')
,(2, '2016Oct10'),(2, '2017Jan01')

SELECT * FROM @MyListOfDates
GO

DateTypeID Date
---------- ----------
1          2016-01-01
1          2016-01-02
1          2016-01-03
2          2016-10-10
2          2017-01-01



Hopefully, that looks straightforward; the way we’ve used it here is not dissimilar to a table variable (e.g. DECLARE @MyTable AS TABLE...), but we can’t pass a table variable to a sproc. We’ll create a test sproc and pass our new user-defined type to it:

-- If the sproc already exists, delete it:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create our sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
)
AS
BEGIN

SELECT
DateTypeID
,MIN([Date]) AS MinDate
,MAX([Date]) AS MaxDate
FROM @ListOfDates
GROUP BY DateTypeID
ORDER BY DateTypeID

END
GO

-- Test it out:
DECLARE @MyListOfDates AS dbo.ListOfDates
INSERT @MyListOfDates(DateTypeID, [Date])
VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03')
,(2, '2016Oct10'),(2, '2017Jan01')
,(3, '2017Feb01'),(3, '2017Feb03'),(3, '2017Feb28')

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates
GO

DateTypeID MinDate    MaxDate
---------- ---------- ----------
1          2016-01-01 2016-01-03
2          2016-10-10 2017-01-01
3          2017-02-01 2017-02-28



See the READONLY in the CREATE PROCEDURE? Here’s what happens when you omit it:

The table-valued parameter "@ListOfDates" must be declared with the READONLY option.

This is one of the limitations of user-defined types: they can’t be used to pass data back. So we couldn’t make any changes to @ListOfDates in the sproc, and see those changes reflected outside of the sproc.

There’s another limitation: UDTs can’t be used cross-database. Here’s what happens:

The type name 'MyOtherDatabase.dbo.ListOfDates' contains more than the
maximum number of prefixes. The maximum is 1.

Even if you create the exact same type in another database, with the same name, it won’t work.

(Note that UDTs don’t have to be tables: go here [MSDN] for more info.)

Just for kicks, here’s how I’d replicate the above functionality, but using XML:

-- If the sproc already exists, delete it:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create the sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
@ListOfDates XML
)
AS
BEGIN

WITH cte_ExtractFromXML AS
(
SELECT
DateTypeID	= d.i.value('(@type)[1]', 'INT')
,[Date]		= d.i.value('.[1]', 'DATE')
FROM @ListOfDates.nodes('//date') d(i)
)
SELECT
DateTypeID
,MIN([Date]) AS MinDate
,MAX([Date]) AS MaxDate
FROM cte_ExtractFromXML
GROUP BY DateTypeID
ORDER BY DateTypeID

END
GO

-- Test it with some XML data:
DECLARE @MyListOfDates XML
SET @MyListOfDates = '
<listOfDates>
<date type="1">2016Jan01</date>
<date type="1">2016Jan02</date>
<date type="1">2016Jan03</date>
<date type="2">2016Oct10</date>
<date type="2">2017Jan01</date>
<date type="3">2017Feb01</date>
<date type="3">2017Feb03</date>
<date type="3">2017Feb28</date>
</listOfDates>'

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates
GO

DateTypeID  MinDate    MaxDate
----------- ---------- ----------
1           2016-01-01 2016-01-03
2           2016-10-10 2017-01-01
3           2017-02-01 2017-02-28



If we wanted to, we could declare our @ListOfDates parameter as OUTPUT, and make changes to it in the sproc, e.g.:

Drop the sproc if it exists:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create our (new and improved) sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
@ListOfDates XML OUTPUT
)
AS
BEGIN

DECLARE @Aggregated XML

;WITH cte_ExtractFromXML AS
(
SELECT
DateTypeID	= d.i.value('(@type)[1]', 'INT')
,[Date]		= d.i.value('.[1]', 'DATE')
FROM @ListOfDates.nodes('//date') d(i)
)
SELECT @Aggregated = (
SELECT
DateTypeID AS '@id'
,MIN([Date]) AS '@MinDate'
,MAX([Date]) AS '@MaxDate'
FROM cte_ExtractFromXML
GROUP BY DateTypeID
ORDER BY DateTypeID
FOR XML PATH('DataType'), ROOT('Aggregated')
)

SELECT @ListOfDates = (
SELECT @ListOfDates, @Aggregated FOR XML PATH('Output'), TYPE
)

END
GO

-- Run it:
DECLARE @MyListOfDates XML
SET @MyListOfDates = '
<listOfDates>
<date type="1">2016Jan01</date>
<date type="1">2016Jan02</date>
<date type="1">2016Jan03</date>
<date type="2">2016Oct10</date>
<date type="2">2017Jan01</date>
<date type="3">2017Feb01</date>
<date type="3">2017Feb03</date>
<date type="3">2017Feb28</date>
</listOfDates>'

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates OUTPUT

SELECT @MyListOfDates AS MyListOfDates
GO

<Output>
<listOfDates>
<date type="1">2016Jan01</date>
<date type="1">2016Jan02</date>
<date type="1">2016Jan03</date>
<date type="2">2016Oct10</date>
<date type="2">2017Jan01</date>
<date type="3">2017Feb01</date>
<date type="3">2017Feb03</date>
<date type="3">2017Feb28</date>
</listOfDates>
<Aggregated>
<DataType id="1" MinDate="2016-01-01" MaxDate="2016-01-03" />
<DataType id="2" MinDate="2016-10-10" MaxDate="2017-01-01" />
<DataType id="3" MinDate="2017-02-01" MaxDate="2017-02-28" />
</Aggregated>
</Output>


As you can see, we’ve returned our original list, along with the aggregated data, both under a new parent node.