Re-creating complete records from audit data

Quite often, you’ll find that changes to data in a given table have been recorded in a separate table, with fields like this:

[field name], [timestamp], [old value], [new value]

Or maybe the data changes for ALL tables are recorded in the same place, in which case we’d have a column at the start telling us what object had been updated:

[object id], [field name], [timestamp], [old value], [new value]

Let’s say we have a Customer table, and the changes are recorded in a separate audit table. So if, say, the first name, date of birth, and postcode for a customer with ID 1234 were changed at the same time, we’d expect to see these records in our audit table:

CustomerID  FieldName   Timestamp         OldValue   NewValue
----------- ----------- ----------------- ---------- ----------
1234        FirstName   20200708 09:58:11 Rich       Richard
1234        DateOfBirth 20200708 09:58:11 1981-04-03 1981-04-02
1234        Postcode    20200708 09:58:11 SW1 1AA    SW1 2AA

It’s very common to want to re-format this data so that we get one row (resembling the original Customer record) per change, such that the row contains all the fields as they were at the time of the change. How do we do that? If we PIVOT just the new values, we get:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-02  SW1 2AA

Looks ok. (We need to generate a record for the old values, but we’ll worry about that later.)

But what if the timestamps weren’t all the same – i.e. the three changes didn’t happen at the same time – and we did the PIVOT?

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    NULL        NULL
1234        20200709 09:58:12 NULL       1981-04-02  NULL
1234        20200710 09:58:13 NULL       NULL        SW1 2AA

This isn’t what we want, those NULLs shouldn’t be there. The above is just showing the new values for the field that’s changed. What we really want is this:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

which shows ALL the fields as they were at the given date/time. And maybe an initial record, prior to any changes:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20010101 00:00:00 Rich       1981-04-03  SW1 1AA 
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

Obviously, in the data as shown, we have no idea what the original timestamp was; we need to use something that makes sense in context, like the date the original Customer record was written for the first time. (I’ve only used 1st Jan 2001 here to make the substitution obvious.)

How can we achieve this?

Worked example

Let’s work through a solution, using example data. Our data consists of two people who have an id, a height, a credit score, and a postcode.

IF OBJECT_ID('tempdb.dbo.#Changes','U') IS NOT NULL
BEGIN
	DROP TABLE #Changes
END
GO

SELECT
		PersonID	= CAST(PersonID AS INT)
		,[Date]		= CAST([Date] AS DATE)
		,[Field]
		,[From]
		,[To]
	INTO #Changes
	FROM (
		VALUES
			 (101, '2010Jan01', 'Height', '160', '161')
			,(101, '2010Jan01', 'CreditScore', '541', '542')
			,(101, '2010Feb02', 'Height', '161', '162')
			,(101, '2010Feb03', 'CreditScore', '542', '538')
			,(101, '2011Mar03', 'Height', '162', '163')
			,(101, '2011Mar03', 'CreditScore', '538', '536')
			,(101, '2012Dec30', 'Postcode', 'W1A 1AA', 'SW1A 2AA')
			,(102, '2015Jul15', 'Postcode', 'SW1A 1AA', 'SE1 9TG')
			,(102, '2015Aug21', 'Postcode', 'SE1 9TG', 'SE1 9DT')
			,(102, '2016Apr04', 'CreditScore', '602', '606')
			,(102, '2017May05', 'Height', '150', '148')
	) x(PersonID, [Date], [Field], [From], [To])
GO

SELECT * FROM #Changes
GO
PersonID    Date       Field       From     To
----------- ---------- ----------- -------- --------
101         2010-01-01 Height      160      161
101         2010-01-01 CreditScore 541      542
101         2010-02-02 Height      161      162
101         2010-02-03 CreditScore 542      538
101         2011-03-03 Height      162      163
101         2011-03-03 CreditScore 538      536
101         2012-12-30 Postcode    W1A 1AA  SW1A 2AA
102         2015-07-15 Postcode    SW1A 1AA SE1 9TG
102         2015-08-21 Postcode    SE1 9TG  SE1 9DT
102         2016-04-04 CreditScore 602      606
102         2017-05-05 Height      150      148

Note that the From/To fields are strings; that’s the only way to record this data for fields that could be of any type. (We’ll have to remember to convert the fields back to their proper datatypes when we’re done.)

How many records are we expecting in our final recordset?

SELECT PersonID, [Date]
FROM #Changes
GROUP BY PersonID, [Date]
PersonID    Date
----------- ----------
101         2010-01-01
101         2010-02-02
101         2010-02-03
101         2011-03-03
101         2012-12-30
102         2015-07-15
102         2015-08-21
102         2016-04-04
102         2017-05-05

For PersonID = 1, our final result set will have 5 rows, plus 1 for our initial record (prior to any changes), a total of 6 rows. For PersonID 2, we’ll have 5 (4+1).

First, let’s expand the data out so we have a row per change per field (plus our initial ‘pre-change’ row):

IF OBJECT_ID('tempdb.dbo.#Expanded','U') IS NOT NULL
BEGIN
	DROP TABLE #Expanded
END
GO

;WITH cte_AllChanges AS
(
	SELECT
			x.PersonID
			,x.[Date]
			,y.[Field]
			,c.[To] AS [Value]
		FROM (
			-- All the possible PersonIDs and Dates:
			SELECT PersonID, [Date]
				FROM #Changes
				GROUP BY PersonID, [Date]
		) x
		-- All the possible fields:
		CROSS JOIN (
			SELECT [Field]
				FROM #Changes
				GROUP BY [Field]
		) y
		-- Join on the values we know about:
		LEFT JOIN #Changes c
			ON c.PersonID = x.PersonID
			AND c.[Date] = x.[Date]
			AND c.[Field] = y.[Field]

), cte_InitialRows AS
(
	-- The 'initial' rows for each of our PersonIDs
	SELECT
			PersonID
			,'2001Jan01' AS [Date]
			,[Field]
			,[From] AS [Value]
			,rn = ROW_NUMBER() OVER (PARTITION BY PersonID, [Field] ORDER BY [Date])
		FROM #Changes
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		INTO #Expanded
		FROM cte_AllChanges
	
	UNION ALL

	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		FROM cte_InitialRows
		WHERE rn = 1
GO

SELECT * FROM #Expanded
ORDER BY PersonID, [Date], Field
GO
PersonID    Date       Field       Value
----------- ---------- ----------- --------
101         2001-01-01 CreditScore 541
101         2001-01-01 Height      160
101         2001-01-01 Postcode    W1A 1AA
101         2010-01-01 CreditScore 542
101         2010-01-01 Height      161
101         2010-01-01 Postcode    NULL
101         2010-02-02 CreditScore NULL
101         2010-02-02 Height      162
101         2010-02-02 Postcode    NULL
101         2010-02-03 CreditScore 538
101         2010-02-03 Height      NULL
101         2010-02-03 Postcode    NULL
101         2011-03-03 CreditScore 536
101         2011-03-03 Height      163
101         2011-03-03 Postcode    NULL
101         2012-12-30 CreditScore NULL
101         2012-12-30 Height      NULL
101         2012-12-30 Postcode    SW1A 2AA
102         2001-01-01 CreditScore 602
102         2001-01-01 Height      150
102         2001-01-01 Postcode    SW1A 1AA
102         2015-07-15 CreditScore NULL
102         2015-07-15 Height      NULL
102         2015-07-15 Postcode    SE1 9TG
102         2015-08-21 CreditScore NULL
102         2015-08-21 Height      NULL
102         2015-08-21 Postcode    SE1 9DT
102         2016-04-04 CreditScore 606
102         2016-04-04 Height      NULL
102         2016-04-04 Postcode    NULL
102         2017-05-05 CreditScore NULL
102         2017-05-05 Height      148
102         2017-05-05 Postcode    NULL

(33 rows affected)

Now for the clever bit. For each of the NULLs in the recordset above, we want to fill in its last known value; that is, whatever preceding value we had for that field that wasn’t a NULL. On the face of it, it’s not a trivial requirement, there’s no nice in-built window function to help us. Luckily, there’s a great blog post by Itzik Ben-Gan: The Last non NULL Puzzle that gives us the answer. (I have it book-marked, I use it a lot!)

Adapting Itzik Ben-Gan’s code from that website, we have:

IF OBJECT_ID('tempdb.dbo.#LastKnownNull','U') IS NOT NULL
BEGIN
	DROP TABLE #LastKnownNull
END
GO

;WITH cte_LastNonNull AS
(
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,grp = MAX(CASE WHEN [Value] IS NOT NULL THEN [Date] END)
					OVER (
						PARTITION BY PersonID, [Field]
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
		FROM #Expanded
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,[Last non-NULL Value] = MAX([Value])
					OVER (
						PARTITION BY PersonID, [Field], grp
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
	INTO #LastKnownNull
	FROM cte_LastNonNull
GO

SELECT * FROM #LastKnownNull
GO
PersonID    Date       Field       Value    Last non-NULL Value
----------- ---------- ----------- -------- -------------------
101         2001-01-01 CreditScore 541      541
101         2001-01-01 Height      160      160
101         2001-01-01 Postcode    W1A 1AA  W1A 1AA
101         2010-01-01 CreditScore 542      542
101         2010-01-01 Height      161      161
101         2010-01-01 Postcode    NULL     W1A 1AA
101         2010-02-02 CreditScore NULL     542
101         2010-02-02 Height      162      162
101         2010-02-02 Postcode    NULL     W1A 1AA
101         2010-02-03 CreditScore 538      538
101         2010-02-03 Height      NULL     162
101         2010-02-03 Postcode    NULL     W1A 1AA
101         2011-03-03 CreditScore 536      536
101         2011-03-03 Height      163      163
101         2011-03-03 Postcode    NULL     W1A 1AA
101         2012-12-30 CreditScore NULL     536
101         2012-12-30 Height      NULL     163
101         2012-12-30 Postcode    SW1A 2AA SW1A 2AA
102         2001-01-01 CreditScore 602      602
102         2001-01-01 Height      150      150
102         2001-01-01 Postcode    SW1A 1AA SW1A 1AA
102         2015-07-15 CreditScore NULL     602
102         2015-07-15 Height      NULL     150
102         2015-07-15 Postcode    SE1 9TG  SE1 9TG
102         2015-08-21 CreditScore NULL     602
102         2015-08-21 Height      NULL     150
102         2015-08-21 Postcode    SE1 9DT  SE1 9DT
102         2016-04-04 CreditScore 606      606
102         2016-04-04 Height      NULL     150
102         2016-04-04 Postcode    NULL     SE1 9DT
102         2017-05-05 CreditScore NULL     606
102         2017-05-05 Height      148      148
102         2017-05-05 Postcode    NULL     SE1 9DT

(33 rows affected)

All that’s left to do is PIVOT up these values into one row per PersonID/Date:

SELECT *
	FROM (
		SELECT
				PersonID, [Date], [Field], [Last non-NULL Value] AS [Value]
			FROM #LastKnownNull
	) x
	PIVOT
	(
		MIN([Value]) FOR [Field] IN ([CreditScore],[Height],[Postcode])
	) p
	ORDER BY PersonID, [Date]
GO
PersonID    Date       CreditScore Height   Postcode
----------- ---------- ----------- -------- --------
101         2001-01-01 541         160      W1A 1AA
101         2010-01-01 542         161      W1A 1AA
101         2010-02-02 542         162      W1A 1AA
101         2010-02-03 538         162      W1A 1AA
101         2011-03-03 536         163      W1A 1AA
101         2012-12-30 536         163      SW1A 2AA
102         2001-01-01 602         150      SW1A 1AA
102         2015-07-15 602         150      SE1 9TG
102         2015-08-21 602         150      SE1 9DT
102         2016-04-04 606         150      SE1 9DT
102         2017-05-05 606         148      SE1 9DT

And we’re done: from a list of changes, we’ve recreated the record as it was at the time of each change. This will allow to us to more easily write queries concerned with how our data looked at any given point in time.

, , , ,

Leave a comment

Regression: The analyst’s penknife

At work, I build regression models all the time; usually logistic regression, where the target variable is binary (0/1 or no/yes), e.g.:

  • Sales and marketing models (predicting which potential leads will convert)
  • Credit risk accept/reject scorecards (if we lend this person money, will they pay us back?)
  • Collections models (which of our customers is most likely to default in the next month?)
  • Fraud models (is this customer who they say they are?)

, but occasionally linear regression models too, where the target is a number (e.g. number of sales). The modelling process is basically the same between linear and logistic regression, it’s just that the model output requires a different interpretation.

These models are all probabilistic – that is, there is a component of randomness in the model that means there can never be a mathematical formula that produces the true output for any given inputs. But regression can help us do more than that. I want to mention a couple of times where I’ve used linear regression to uncover deterministic equations; that is, an exact and precise mathematical description that links the inputs to the output.

1. Decrypting a secret number (sort of)

A company presented questionnaires to people, where each question had multiple-choice answers. For various reasons, the answers were required to be secret within the database – an average developer was able to see the data, but they weren’t supposed to be able to link answers to a person. (This wasn’t ‘top secret’ data; the secrecy was more for marketing purposes than anything.)

The database tables looked something like this (I’ve only shown the important columns and keys):

CREATE TABLE dbo.Person
(
	PersonID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire
(
	QuestionnaireID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire_Completed
(
	QuestionnaireID INT NOT NULL
	,PersonID INT NOT NULL
	,PRIMARY KEY CLUSTERED (QuestionnaireID, PersonID)
)

CREATE TABLE dbo.Question
(
	QuestionID INT NOT NULL PRIMARY KEY CLUSTERED
	,QuestionnaireID INT NOT NULL
)

CREATE TABLE dbo.Answer
(
	AnswerID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
	,QuestionID INT NOT NULL
	,AnswerKey INT NOT NULL		-- mysterious!
)

For each answer, we knew what question and therefore questionnaire it related to, so we could do:

SELECT
		q.QuestionID
		,a.AnswerID
		,COUNT(1) AS N
	FROM dbo.Answer a
	JOIN dbo.Question q
		ON q.QuestionID = a.QuestionID
	WHERE q.QuestionnaireID = 123
	GROUP BY q.QuestionID, a.AnswerID

to get the results for a particular questionnaire. This whole process was built and administered by a single person. One day, that person was unavailable, and there was an urgent problem relating to a particular questionnaire; for legitimate reasons, a higher-up needed to know the answers a person had given.

Now, if the table Questionnaire_Completed had a field showing when the entry was inserted (a default Create date, see this post, for instance), then matching the data would’ve been trivial: the order in which the questionnaires were done would’ve almost certainly matched between the Questionnaire_Completed and Answer tables. However, we didn’t have that date; and it was probably not recorded for this very reason, to prevent correlating the pieces of information that would instantly reveal who answered what.

What we did have was the mysterious field AnswerKey in the Answer table. It wasn’t immediately obvious what it represented; it was just an integer that appeared random:

1763546779
-1375398177
406260249
-241234845

, so what was it? Clearly, it was very likely related to the PersonID (otherwise why obfuscate it?), but scrambled in a way to make it purposely difficult to join back. What to do? Well, linear regression tells you how numbers relate to other numbers, statistically, so that seemed a good place to start.

I created a rectangular dataset of what we had: QuestionnaireID, QuestionID, AnswerID, and AnswerKey (several thousand rows), and imported the data into R. Using linear regression (the glm function), I pretty quickly worked out that AnswerKey was of the form:

AnswerKey = k + a * QuestionnaireID + b * QuestionID + {SOMETHING ELSE}

How did I know this? Easy: the constant k, and parameters a and b were suspiciously close to being whole numbers, something that’s unlikely to happen by chance.

But what was the {SOMETHING ELSE}? It didn’t take very long to figure out that is was a simple linear function of PersonID and a fixed integer, one for each questionnaire. Given that we knew which PersonIDs had completed which questionnaire, we had a reduced set of possibilities — only certain values made sense.

The final formula for AnswerKey was:

k + a * QuestionnaireID + b * QuestionID + c * {Questionnaire Secret Number} + d * PersonID

, hence we could work out the PersonID for every record in the Answer table because we knew all the other pieces of information.

Overall, it probably took less than an hour to ‘crack’ the formula for the AnswerKey. As I said at the beginning, this wasn’t information that needed industrial-strength security, so the method used to hide it wasn’t inappropriate. This method of encrypting values isn’t something anyone should consider for sensitive information, e.g. bank details. For that sort of information, SQL Server provides some far better tools: see here. (Or even better: don’t store the data yourself, get a third party company to do it!)

2. Payments on loans with existing charges

NB: I’m working on a post and document which explains this piece of work in much greater detail – watch this space!

Let’s say a customer takes out a loan for £500 over 3 months, and the interest rate is 1% per day (this is a very high interest rate, illustrative purposes only!). If the months are January-March in 2019, then there are 31 days in month 1, 28 days in month 2, 31 days in month 3. How much should the customer pay off each month if (a) the loan is to be completely paid off after the 3rd payment, and (b) the payments are equal in amount?

This isn’t a very straightforward calculation, as the lengths of the months are unequal, and the amount of outstanding principal balance (OPB) that accrues interest will be decreasing each month. The answer doesn’t really matter here – but it’s £275.49 for the first 2 payments, and £275.47 as the final payment (2 pence less, due to rounding). In total, they will pay back £826.45, £326.45 in interest(!)

Now, I inherited some code that calculated these payment amounts, but it was very messy, and there was even a separate function for each number of months. Starting from first principles, I worked out a single equation that took the number of months as a parameter. This worked fine for calculating initial payment schedules, i.e. those for new customers.

However, the business also needed an equation for when payments were being ‘rescheduled’, and there were charges present – e.g. our customer had paid off some of the loan, but had incurred a late fee (£10 say), and was now asking to spread the repayments over 4 months instead of 3. This is where it gets trickier: charges don’t (legally, can’t) accrue interest, so our original equation no longer works. And it was far from clear how to adapt what we had to include this separate balance that didn’t accrue interest.

It’s actually possible to numerically calculate the payments for any schedule, all you need is code to step through the process of accrual and payment. You put your guess in at the top of the code, and see if the balance is zero at the end. If not, adjust the guess slightly, run it through again – keep going, until you get a balance of zero. This is standard stuff, e.g. see Newton’s method for an optimised way of getting to the answer faster.

So for any input we care to put in, we can actually get the correct answer. Doesn’t this automatically solve our problem? Why do we need the precise equations? Three reasons: (1) an equation will give us exact answers, and we’ll understand the precision of the answers we’re getting; (2) an equation will (well, should) take fewer steps to work through, hence less server processing; (3) the satisfaction of working it out!

At this point, I could generate the output (monthly payment) for any input (loan amount, interest rate, number of periods + their lengths, outstanding charges balance), but I just didn’t know how they were related exactly.

I randomly generated values for tens of thousands of inputs, calculated the output using the iterative procedure described above, and made a rectangular dataset. In R, as before, I used glm to start fitting models that helped me identify the relationship between the input and output values; in fact, I came up with code to auto-generate thousands of model formulas, and let the computer do even more of the work for me. For instance, if we have inputs A, B, C, D and an output E, then we can let the computer come up with new formulas at random:

E ~ A + D + (1/B)
E ~ A + B + (C/D) + ((A-B)/(C+D))
E ~ (C-B+A)/(D+A) + (B*C*D)/(A-3*D)
E ~ (A-3*B-C^2)/(sqrt(D+3)-sqrt(A*B))

etc. (The algorithm was a bit smarter than this, I’m just giving a flavour here.) It took a few evenings of trial and error playing with the code, but eventually, this approach worked, and I zoned in on the exact equations I needed. Unfortunately, it turns out that there’s no single equation, but multiple for each number of periods – however, they’re very ‘constructable’, the resulting function in code is very straightforward. I didn’t yet understand how to derive these equations from first principles, that was the next step – but I knew what I was aiming for, which made that task much easier.

, ,

Leave a comment

Aggregating all the categorical data in a table

One of the measures used to compare datasets for similarity is the Population Stability Index (PSI) value. (See here or here for details on how to calculate it.) It’s used for comparing, e.g. one financial quarter with another, or a set of training data to a set of test data. You generate the PSI value for each variable that exists in both datasets, and very simply, a low value implies two distributions are the same, a high value implies they are different. If the PSI values for every variable are below a certain threshold, then you can infer the two datasets follow the same overall distribution, and that models built over one dataset will work satisfactorily on the other.

We often need to calculate the PSI value for all categorical variables in a dataset; in SQL, this translates to finding the COUNTs for all the possible values in a column, for all columns, where the column is usually of datatype (N)VARCHAR.

Our example dataset, stored as a table MyCategories in SQL Server, has 10,000 rows and 6 columns: one integer primary key column (ID), and five VARCHAR:

SELECT TOP 100 * FROM dbo.MyCategories
ID          Cat1       Cat2       Cat3       Cat4       Cat5
----------- ---------- ---------- ---------- ---------- ----------
1           Alan       Emma       South      Start      Up
2           Bobby      Donna      North      Middle     Down
3           Bobby      Belinda    North      NULL       NULL
4           David      Donna      NULL       Start      NULL
5           Charlie    Anna       West       Start      Up

To perform our PSI calculation across all variables at once, we require a dataset consisting of all the different category values, and their counts. Naively, we could just do this:

SELECT Category = 1, Cat1 AS [Value], COUNT(1) AS N FROM dbo.MyCategories GROUP BY Cat1
UNION ALL SELECT Category = 2, Cat2, COUNT(1) FROM dbo.MyCategories GROUP BY Cat2
UNION ALL SELECT Category = 3, Cat3, COUNT(1) FROM dbo.MyCategories GROUP BY Cat3
UNION ALL SELECT Category = 4, Cat4, COUNT(1) FROM dbo.MyCategories GROUP BY Cat4
UNION ALL SELECT Category = 5, Cat5, COUNT(1) FROM dbo.MyCategories GROUP BY Cat5

which gives us the required result set:

Category Value      N
-------- ---------- -----
1        NULL       832
1        Alan       1650
1        Bobby      1644
1        Charlie    1604
1        David      1612
1        Eric       1779
1        Frank      879
2        NULL       1003
2        Anna       2042
...

This is fine for a few columns. But what about when you have hundreds?

We could generate our list of UNION ALLs, one for each of our columns, by various mechanisms — I often use Excel for things like this. But we’ll do it differently here, by using a loop. Firstly, let’s get our columns of interest and store them in a temp table:

IF OBJECT_ID('tempdb.dbo.#Columns','U') IS NOT NULL
BEGIN
	DROP TABLE #Columns
END
GO

CREATE TABLE #Columns
(
	colid INT NOT NULL PRIMARY KEY CLUSTERED
	,colname VARCHAR(100) NOT NULL
)
GO

;WITH cte_Columns AS
(
	SELECT
			ORDINAL_POSITION
			,COLUMN_NAME
		FROM INFORMATION_SCHEMA.COLUMNS
		WHERE TABLE_SCHEMA = 'dbo'
		AND TABLE_NAME = 'MyCategories'
		-- Your selection criteria here;
		-- in my case, it was all columns of type VARCHAR
		AND DATA_TYPE = 'varchar'
)
INSERT #Columns
(
	colid
	,colname
)
	SELECT
			colid	 = ROW_NUMBER() OVER (ORDER BY ORDINAL_POSITION)
			,colname = COLUMN_NAME
	FROM cte_Columns
GO

SELECT * FROM #Columns
GO

colid colname
----- -------
1     Cat1
2     Cat2
3     Cat3
4     Cat4
5     Cat5

Next, we’ll use a WHILE loop to do the aggregation for each column:

IF OBJECT_ID('tempdb.dbo.##Results','U') IS NOT NULL
BEGIN
	DROP TABLE ##Results
END
GO

CREATE TABLE ##Results
(
	[var] VARCHAR(100) NOT NULL
	,i INT NOT NULL
	,[value] VARCHAR(100) NULL
	,N INT NOT NULL
	,Total INT NOT NULL
)
GO

DECLARE @id INT, @maxid INT, @colname VARCHAR(100), @sql NVARCHAR(MAX)
SELECT @id = 1, @maxid = MAX(colid) FROM #Columns

WHILE (@id <= @maxid)
BEGIN

	SET @colname = NULL
	SET @sql = NULL

	SELECT @colname = colname FROM #Columns WHERE colid = @id

	SET @sql = N'
		INSERT ##Results
		(
			[var]
			,i
			,[value]
			,N
			,Total
		)
		SELECT
			''' + @colname + ''' as [var]
			,ROW_NUMBER() OVER (ORDER BY ' + @colname + ') AS i
			,' + @colname + ' AS [value]
			,COUNT(1) AS N
			,SUM(COUNT(1)) OVER () AS Total
			FROM dbo.MyCategories
			GROUP BY ' + @colname + '
			ORDER BY ' + @colname 

	EXEC sp_executesql @sql

	SET @id = @id + 1

END
GO

SELECT * FROM ##Results ORDER BY [var], i
GO
var  i    value      N    Total
---- ---- ---------- ---- -----
Cat1 1    NULL       832  10000
Cat1 2    Alan       1650 10000
Cat1 3    Bobby      1644 10000
Cat1 4    Charlie    1604 10000
Cat1 5    David      1612 10000
Cat1 6    Eric       1779 10000
Cat1 7    Frank      879  10000
Cat2 1    NULL       1003 10000
Cat2 2    Anna       2042 10000
...

The results are the same, ignoring a couple of handy extra columns that we’ll need later in our calculation.

Some notes:

  • Everything between ‘SET @sql = N’ and the EXEC is inside a string, and sp_executesql executes (runs) the SQL contained within the string variable @sql. This technique of programmatically building SQL within a string and then executing it is referred to as ‘dynamic SQL’. (If you’ve not encountered it before, putting an N directly before the first quote means the string is ‘wide’, of type NVARCHAR.)
  • Why do I bother setting the variables @colname and @sql to NULL, before immediately populating them in the next two statements? Bitter experience! If the code that follows doesn’t work somehow, your local working variables might end up with values you’re not expecting, hence it’s just good practice to set them to NULL at the start of the loop. (Believe me, it can save a lot of head-scratching and debugging.)
  • This might look like a lot of code, but it only takes a few minutes to write; I tend to follow the same pattern for all my SQL loops.

You’re probably thinking “Is there a shorter way of doing this?”. The answer is: yes, but with caveats. Here’s one approach – UNPIVOT our columns of interest, then aggregate:

;WITH cte_UnpivotCats AS
(
	SELECT
			ID
			,[var]
			,[value]
		FROM dbo.MyCategories
		UNPIVOT
		(
			[value] FOR [var] IN ([Cat1],[Cat2],[Cat3],[Cat4],[Cat5])
		) x
)
	SELECT
			[var]
			,[value]
			,COUNT(1) AS N
		FROM cte_UnpivotCats
		GROUP BY [var]
			,[value]
GO

var  value      N
---- ---------- ----
Cat1 Alan       1650
Cat1 Bobby      1644
Cat1 Charlie    1604
Cat1 David      1612
Cat1 Eric       1779
Cat1 Frank      879
Cat2 Anna       2042
Cat2 Belinda    1995
...

Spotted the problem? UNPIVOT doesn’t return any data when the [value] is NULL. If your data absolutely doesn’t have any NULLs, then yes, this works. But if you’ve got hundreds of columns, then you’ll still have to build up the list of columns within the UNPIVOT as a list, in order to execute it as dynamic SQL. (Oh, and you’ll have to ensure that the variables are all of the exact same type and length, UNPIVOT is very fussy about that!)

Here’s a modification that uses CROSS JOIN instead of UNPIVOT (adapted from a post on stackoverflow), which preserves the NULLs; but as above, you still have to work out how you’re going to code this if you have hundreds of columns.

;WITH cte_UnpivotCats AS
(
	SELECT
			ID
			,[var] = x.colname
			,[value] = CASE x.colname
						WHEN 'Cat1' THEN m.Cat1
						WHEN 'Cat2' THEN m.Cat2
						WHEN 'Cat3' THEN m.Cat3
						WHEN 'Cat4' THEN m.Cat4
						WHEN 'Cat5' THEN m.Cat5
					END
		FROM dbo.MyCategories m
		CROSS JOIN (
			VALUES
				('Cat1'),('Cat2'),('Cat3'),('Cat4'),('Cat5')
		) x(colname)

)
	SELECT
			[var]
			,[value]
			,COUNT(1) AS N
		FROM cte_UnpivotCats
		GROUP BY [var]
			,[value]
GO
var  value      N
---- ---------- ----
Cat1 NULL       832
Cat1 Alan       1650
Cat1 Bobby      1644
Cat1 Charlie    1604
Cat1 David      1612
Cat1 Eric       1779
Cat1 Frank      879
Cat2 NULL       1003
Cat2 Anna       2042
...

And we have our NULLs back. Again, we’ll have to use dynamic SQL if we want to build our query programmatically due to having many columns. But we now have our table of aggregate values and counts; if we have this data for both datasets, then calculating the PSI value is easy.

As for efficiency, in terms of ‘logical reads’ alone, then the CROSS JOIN method is the worst, the UNPIVOT the best; but of course, the UNPIVOT doesn’t preserve NULLs. The loop method is in-between the two; if there are appropriate indexes, then this will of course help. Having said that, (a) these queries aren’t generally something you run very often, and (b) all of the queries completed in < 10 milliseconds on my under-powered desktop PC.

The main reason I use the loop method, rather than CROSS JOIN, is that it enables me to ‘get in there’, and make small adjustments to the code as I find it necessary – e.g., there might be valid reasons for excluding particular values from an aggregate if they don’t exist in one or other of the datasets. It’s more difficult to do that when using UNPIVOT or CROSS JOIN.

, , , ,

Leave a comment

Don’t use functions on the left-hand side of a condition in SQL

I’ve seen this pattern in production code several times within the past month, so thought I’d explain here why it’s a bad idea. Simply put, don’t write queries that look like this:

SELECT Field1, Field2, ...
FROM MyTable
WHERE SomeFunction(someColumn) = someValue

, i.e. with a function applied to a column on the left-hand side of the condition in the WHERE clause. In the recent examples I’ve seen, the WHERE clause usually contains something date-related like this:

WHERE YEAR(someDate) = 2020

or

WHERE YEAR(someDate) = 2020
AND MONTH(someDate) = 5

These queries always run worse than if the condition had been re-written so that the functions were on the right-hand side (or removed altogether). The reasoning is this:

  • When putting together an execution plan for the query, the SQL Engine wants to use appropriate indexes to get the best performance;
  • These indexes have been previously created using columns in the table;
  • BUT functions of the columns used in the index might not be ordered in the same way;
  • Because of this, the parser won’t use the index in the way it’s supposed to, and often the whole table is scanned — which could be very costly in terms of performance.

[There’s so much that could be written here about statistics, parsing, etc., but I’m trying to keep it simple]

This sort of condition (where there’s a function on the left-hand side) has a name: it’s called non-sargable. Conversely, “a condition (or predicate) in a query is said to be sargable if the DBMS engine can take advantage of an index to speed up the execution of the query.” (See Wikipedia)

Here’s a demonstration. I’ve got a table where each row just contains an ID and a date:

CREATE TABLE dbo.DateTest
(
	ID INT NOT NULL PRIMARY KEY CLUSTERED
	,[OurDate] DATETIME NOT NULL
)
GO

I’ve filled it with one million rows: ID = 1 to 1000000, and OurDate is a random date (down to a resolution of 1 second) between the years 2013 and 2020.

Of course, we want an index on our datetime column:

CREATE INDEX IX_DateTest_OurDate ON dbo.DateTest(OurDate)

Now let’s run a couple of queries at the same time (‘in the same batch’). The first is ‘non-sargable’, because it uses the YEAR function on the left-hand side of the equals. In the second query, we’ve written it to be ‘sargable’ — but it’s the exact same logic, and should return the same answer.

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE YEAR(OurDate) >= 2019

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE OurDate >= '2019Jan01'

They do both return the same answer: 180348. But their execution plans are different (look at the bits highlighted in yellow):

Execution plan showing the second query is better

The first query scanned (i.e. read through) the whole index, and the query cost was 83% of the batch, compared to the second query which used the index to ‘seek’ the correct data, and was only 17% of the batch.

[Wait, I hear you say — the first query didn’t scan the whole table, it used the index, so that’s ok? Sadly not; it only used the index because our example is so simple that it can calculate the COUNT(1) result from the index alone. In most real-world examples, it’ll scan the table (or the clustered index, if the table has one).]

Look at the two sets of query statistics:

Table 'DateTest'. Scan count 1, logical reads 2236, ...

SQL Server Execution Times:
   CPU time = 78 ms,  elapsed time = 126 ms.

Table 'DateTest'. Scan count 1, logical reads 407, ...

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 148 ms.

The first (non-sargable) query read 5 times as many rows as the second (sargable) query, and took nearly 5 times as much CPU time. (We’ll ignore the elapsed time, it’s not relevant here.)

The difference is even greater if we restrict our query to a single month:

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE YEAR(OurDate) = 2018
	AND MONTH(OurDate) = 5

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE OurDate >= '2018May01'
	AND OurDate < '2018Jun01'

We get the same execution plans as before, except this time the query costs were even further apart: 98% and 2%! (Compared to 83% and 17% previously.) What about the statistics?

Table 'DateTest'. Scan count 1, logical reads 2236, physical reads 0, ...

SQL Server Execution Times:
   CPU time = 109 ms,  elapsed time = 113 ms.

Table 'DateTest'. Scan count 1, logical reads 34, physical reads 0, ... 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 46 ms.

This time, the first (non-sargable) query read 65 times as many rows as the second (sargable) query, and took a lot more CPU time, comparatively. Clearly, it’s worthwhile to ensure all your queries are ‘sargable’.

A final note: In the version of SQL Server I’m using (2017), the parser can’t even cope with conditions that look easily fixable; e.g. a WHERE clause like this:

WHERE ID >= 100

uses an index seek, but this:

WHERE ID+1 >= 101

uses a table scan! Hopefully, future versions of SQL will be able to spot and re-arrange obvious cases like this.

, , , , ,

Leave a comment

Creating extra value from string data

In the previous post, I referred to a sales dataset that I’m currently working on. It has plenty of records (hundreds of thousands of rows), but not many variables (columns / fields). In order to build a good predictive model over a dataset that has a limited number of variables, sometimes you have to get creative; here, I’m going to describe just one of the ways I’ve “added value” (as they say) to the data.

In our dataset, we have addresses: Lines 1-5 and Postcode. Usually, what we do with address data is match it to other richer datasets by postcode — for example, you can buy sociodemographic data that will tell you about the composition of the population in that postcode: what the average earnings are, what proportion of people read a newspaper, all manner of interesting data points. In our case, there are no other datasets, so we have to make do with what we have.

With a bit of work (and cribbing from wikipedia), we can easily map the postcodes to regions and countries, which are always useful variables. And, as per the previous post, we can calculate distances to (say) major towns or distribution hubs. But what else?

Well, we can actually extract a fair bit of value from the address string data itself. I know from previous experience that the number of the house/building in the address can be predictive; from a consumer lending dataset a few years ago, it turned out that lower house numbers had better payback rates. (Maybe streets in well-off areas have fewer houses, which means that a house number of ‘765’ has to be in a less well-off area.) But what about the words themselves? As you’d expect, there are many common words in the first lines of addresses: Road, Street, Flat, Avenue, Cottage, Close, Estate, Farm, etc. We took the top 500 most popular words, and added a series of binary variables to our dataset, corresponding to whether the address contained that particular word or not. We can then use standard modelling techniques to refine the variables, and work out which words (or groupings of words) are significant for our model.

One thing though: we want to capture as much information from the address words as we can, and we’re in the fortunate situation of having lots of data, vertically (that is: lots of rows). So we can split our data not just in to the standard train/test, but we can further split the training data, giving us a validation dataset. [Actually, we’re using the following split: 60% training, 20% validation, 20% test.] With this extra dataset, we can check that a word (variable) is predictive in both the train and validation datasets before we add it to our pool of variables available to the model.

How do we test whether the variable is predictive in both datasets? We could look at the IV (Information Value), or we could model the variable and look at the increase in gini? Both perfectly valid, but I found that I ended up with a larger pool of variables than I was expecting. (There’s a hard time constraint on this piece of work, so I can’t let the modelling procedure take all the time it likes when it comes to selecting variables.) So in order to cut the pool down, I did this:

  1. In the training dataset, cross-tab the variable of interest with the dependent (outcome) variable, and calculate the odds ratio (O.R.) and its 95% confidence interval (C.I.).
  2. Do exactly the same, but using the validation dataset.
  3. If, for each variable:
    • Neither the 95% confidence intervals in the training or validation datasets contains 1 (i.e. we think there is a significant association)
    • And (a) the training O.R. is contained within the validation C.I., and the validation O.R. is contained within the training C.I. (i.e. the variable is not significantly different between datasets)

    , then keep the variable. Otherwise, discard it.

We end up with a far smaller, but significant, selection of variables to let the step-wise modelling process choose from.

Here’s the R function I wrote to obtain the odds ratio and confidence interval from a 2×2 table:

table_OddsRatio <- function(tbl, ci_pct=0.95)
{
	odds_ratio <- (tbl[1,1]*tbl[2,2])/(tbl[1,2]*tbl[2,1]);
	odds_ratio_SE <- sqrt(sum(1/as.vector(tbl))); # Careful, this is SE(log(OR))
	
	value <- qnorm(1.0 - ((1.0-ci_pct)/2));
	
	return(c( OR=odds_ratio,
		CI_lower=exp(log(odds_ratio)-value*odds_ratio_SE),
		CI_upper=exp(log(odds_ratio)+value*odds_ratio_SE))
	);
}

Simple example of its use (numbers taken from sphweb.bumc.bu.edu):

HTN_CVD_tbl <- as.table(matrix(c(992,2260,165,1017),byrow=T,nrow=2));
table_OddsRatio(HTN_CVD_tbl);
      OR CI_lower CI_upper 
2.705455 2.258339 3.241093 

For our data, we just used this function, in conjunction with sapply(), over all the appropriate binary variables in our two data frames (datasets).

In fact, without giving too much away, we didn’t just use this technique for the address data, we had other strings in our dataset we could usefully apply the same process to. One of the reasons I enjoy building models and scorecards so much is that it’s quite a creative endeavour, and you learn something new each time you do it!

, , ,

Leave a comment

Distances in SQL with geography datatypes

I’m currently building a model over some sales data that has plenty of records, but doesn’t have very many variables (or ‘data points’) — in fact, less than 20. Compare that with a credit risk model, which might have hundreds or even thousands of variables. One of the variables we have for each record is the postcode; so to try and augment the dataset, I’m adding in distances from the postcode to some key locations (e.g. distribution centres).

It’s a slightly wishy-washy term, but some data science folk would refer to this process as feature engineering: basically, working with the data that you have in order to ‘add value’; the term also means manipulating the data to make it more palatable to whatever system you’re using to model the data.

Getting hold of comprehensive and reliable address-related data in the UK is, to be honest, a challenge. Bits of it are free, bits of it aren’t; there are dozens of companies that will provide various cuts of the data, but I’ve yet to find one company that provides ALL of it.

Luckily, for what we want to do, there’s a free dataset from FreeMapTools that gives us the latitude and longitude for each postcode. So we downloaded the file, imported the CSV file from ukpostcodes.zip into SQL Server, and now we can look at getting our distances.

In order to demonstrate how we calculate our distances, here’s some SQL that populates a table with 7 records, each with lat/long co-ordinates. The first 3 were taken from the FreeMapTools dataset above; the other 4 were found on google, hence the difference in the number of digits. Then, we’re going to CROSS JOIN the table onto itself, to find out the distances between the 7 points.

IF OBJECT_ID('tempdb.dbo.#PlacesOfInterest', 'U') IS NOT NULL
BEGIN
	DROP TABLE #PlacesOfInterest
END
GO

CREATE TABLE #PlacesOfInterest
(
	ID INT NOT NULL
	,Place VARCHAR(50) NOT NULL
	,Postcode VARCHAR(10) NULL
	,Latitude DECIMAL(20,15) NOT NULL
	,Longitude DECIMAL(20,15) NOT NULL
	,Point AS geography::Point(Latitude,Longitude, 4326)
)
GO

INSERT #PlacesOfInterest
(
	ID 
	,Place
	,Postcode
	,Latitude
	,Longitude
)
	VALUES
	(1, 'John O''Groats', 'KW1 4YT', 58.635681924450900, -3.061496401309780)
	,(2, 'Houses Of Parliament', 'SW1A 0AA', 51.499838984969400, -0.124662727958281)
	,(3, 'Land''s End', 'TR19 7AA', 50.065854402527900, -5.713094775087470)
	,(4, 'Centre of London', 'WC2N 4JJ',51.509172, -0.126773)
	,(5, 'Centre of UK', NULL, 52.561928, -1.464854)
	,(6, 'Edinburgh Castle', NULL, 55.948612, -3.200833)
	,(7, 'Cardiff', NULL, 51.481583, -3.179090)
GO

SELECT
		[Place #1] = p1.Place
		,[Place #2] = p2.Place
		,[Distance in km] = CAST(ROUND(p1.Point.STDistance(p2.Point) / 1000.0, 2) AS DECIMAL(18,2))
	FROM #PlacesOfInterest p1
	CROSS JOIN #PlacesOfInterest p2
	WHERE p1.id < p2.id
	ORDER BY p1.id, p2.id
GO

If we run the code, we get:

Place #1              Place #2              Distance in km
--------------------- --------------------- ---------------
John O'Groats         Houses Of Parliament  816.05
John O'Groats         Land's End            969.17
John O'Groats         Centre of London      815.00
John O'Groats         Centre of UK          683.61
John O'Groats         Edinburgh Castle      299.36
John O'Groats         Cardiff               796.46
Houses Of Parliament  Land's End            425.04
Houses Of Parliament  Centre of London      1.05
Houses Of Parliament  Centre of UK          149.75
Houses Of Parliament  Edinburgh Castle      535.03
Houses Of Parliament  Cardiff               212.14
Land's End            Centre of London      425.26
Land's End            Centre of UK          405.89
Land's End            Edinburgh Castle      675.92
Land's End            Cardiff               238.20
Centre of London      Centre of UK          148.83
Centre of London      Edinburgh Castle      534.01
Centre of London      Cardiff               211.98
Centre of UK          Edinburgh Castle      393.55
Centre of UK          Cardiff               168.21
Edinburgh Castle      Cardiff               497.18

(Of course, these distances are ‘as the crow flies’; the road distance on google maps should be greater!)

Hopefully you can see where the magic happens:

  • In the CREATE TABLE code, I’ve declared Point to be a computed column of type geography::Point that automatically creates a value from the latitude and longitude. See here for more information. (The parameter 4326 is the SRID, the Spatial Reference ID, that tells SQL Server we’re using a standard globe, and not some other fancy projection.)
  • In the final query, the function STDistance calculates the distance in metres from the first point to the second. Note that STDistance isn’t a standard-looking SQL function – in object-oriented terminology, it’s a method (or member function) of the Point object.

Prior to the introduction of geography datatypes in SQL Server, we’d have had to calculate the distances from the lat/long using the Haversine formula, but as you can see, it’s much easier now!

A few points about the modelling side of things:

  • The dataset we’re using above – it’s free, so we have to consider that it might have errors in, might never be updated, and the website could disappear at any time.
  • Having said that, when it comes to building models, consistency is more important than accuracy. This might sound counter-intuitive, but if data* is consistent (even consistently wrong) between build and production, the modelling process will still work to an extent, and the model will be useful. (Although, hopefully it’s obvious that more accurate data will always be better.)
  • Pro-tip: in consumer lending, distances between locations can be nicely predictive; if you can, look at distances between home addresses (current and previous), work, and bank account branch.

* We’re talking about the independent variables, the X’s. Not the dependent variable, the Y, that has to be accurate.

, , , , ,

2 Comments

SQL String Fun(ctions)

One of my favourite parts of SQL coding is messing around with parsing strings; there’s usually a fun puzzle to solve at the root of the problem. T-SQL (that is, Microsoft’s version of SQL) doesn’t have any particularly whizzy functions built-in — certainly nothing as powerful as regular expressions — which means that sometimes you have to work that bit harder to find the solution.

Below, I’ve listed a few of the functions and tricks I use.

Does a string contain only 0-9?

For a given string, does it contain only the digits 0,1,2,3,4,5,6,7,8 or 9? Notice we’re not asking “is the string a number?”. (There’s a small but important difference, as we’ll see.)

Now, SQL has a built-in function for determining whether a given string is a number: ISNUMERIC(). As you might expect, ISNUMERIC(‘1729’) is 1; conversely, ISNUMERIC(‘eggs’) is 0. Pretty straightforward.

If we just want to match the individual characters within a string, then we can use the LIKE operator; usually, with LIKE, we’re matching some text in a string (using the ‘%’ character as a wildcard), but you can also match character ranges: e.g. ‘[AEIOU]’ matches any vowel. Conversely, ‘[^AEIOU]’ will match any character that isn’t a vowel. (This syntax will be familiar to people who use regular expressions.)

Take a look at this SQL:

SELECT
	  id
	  ,Word
	  ,[Contains Only 0-9] = CASE WHEN x.Word NOT LIKE '%[^0-9]%' THEN 'Yes' ELSE 'No' END
	  ,[ISNUMERIC(Word)] =  CASE WHEN ISNUMERIC(x.Word) = 1 THEN 'Yes' ELSE 'No' END
	FROM (
	   VALUES
		 (1, 'ABCDEFG')
		,(2, 'abc123')
		,(3, '456DEF')
		,(4, '02345')
		,(5, '1234E56')
		,(6, '789D01')
	) x(id,Word)
	ORDER BY x.id
GO

Before running it, what do you think the output will be?

Here’s the actual output:

id   Word    Contains Only 0-9 ISNUMERIC(Word)
---- ------- ----------------- ---------------
1    ABCDEFG No                No
2    abc123  No                No
3    456DEF  No                No
4    02345   Yes               Yes
5    1234E56 No                Yes
6    789D01  No                Yes

For ids 1, 2 and 3 — they’re clearly not numeric (ok, ‘abc123’ and ‘456DEF’ are valid hexadecimal numbers, but we’re ignoring other bases here). id 4 is numeric, it contains only digits, but is zero-prefixed. But what about id 5? ‘123E56’ is another way of writing 123 x 10^56, so IS a valid number. And id 6? ‘789D01 is the same idea, but in French! They use ‘d’ instead of ‘e’.

So: ISNUMERIC() will identify valid numbers, but if we just want to match the digits 0-9, we need the construction NOT LIKE ‘%[^0-9]%’ – essentially, we’re matching strings that do not contain any non-digits. The ‘NOT LIKE inverse range‘ construction is useful in many situations.

De-duping multiple characters

(This is a trick I’ve seen several times over the years; here, I’ve appropriated the version from this stackoverflow post.)

In this example, we have text that has repeated spaces, and we want to reduce those dupe spaces down to one single space. Here’s a very neat way of doing it:

SELECT 
		x.Sentence
		,[Sentence w/o dupe spaces] = REPLACE(REPLACE(REPLACE(x.Sentence, ' ', '<>'), '><', ''), '<>', ' ')
	FROM (
		VALUES
			 ('This    sentence    has     repeated spaces')
			,('This   sentence also  contains         duplicate       spaces')
			,('This sentence is ok')
			,('Not-a-sentence')
	) x(Sentence)
GO

with output:

Sentence                                                      Sentence w/o dupe spaces
------------------------------------------------------------- ---------------------------------------------
This    sentence    has     repeated spaces                   This sentence has repeated spaces
This   sentence also  contains         duplicate       spaces This sentence also contains duplicate spaces
This sentence is ok                                           This sentence is ok
Not-a-sentence                                                Not-a-sentence

It’s a beautifully simple and clever trick, invaluable when it comes to cleaning data such as addresses.

Replacing multiple characters with TRANSLATE

We don’t have to do multiple calls to the REPLACE() function to replace multiple characters any more; since SQL Server 2017, we can use the TRANSLATE() function instead:

SELECT 
		x.SomeString
		,[Multi REPLACE] = REPLACE(REPLACE(REPLACE(x.SomeString, '.', '-'), '!', ' '), '_', '-')
		,[TRANSLATE] = TRANSLATE(x.SomeString, '.!_', '- -')
	FROM (
		VALUES
			('abc_def!ghi.jkl')
			,('.mno_pqr!stu.vwx')
			,('...!!!...x..!._')
	) x(SomeString)
GO

Here’s the output:

SomeString       Multi REPLACE    TRANSLATE
---------------- ---------------- ----------------
abc_def!ghi.jkl  abc-def ghi-jkl  abc-def ghi-jkl
.mno_pqr!stu.vwx -mno-pqr stu-vwx -mno-pqr stu-vwx
...!!!...x..!._  ---   ---x-- --  ---   ---x-- --

Hopefully, you can see how each character in the second parameter has been swapped for the appropriate character in the third parameter.

It’s not as flexible as REPLACE() in that we can only change a single character for another. But it’s great for things like changing brackets, e.g. SELECT TRANSLATE(‘[408] 555 6789′,'[]’,'()’) changes the square brackets to parentheses to give (408) 555 6789.

Or how about this use for TRANSLATE — how many vowels are there in a string?

DECLARE @name VARCHAR(255)
SET @name = 'Alexander Boris de Pfeffel Johnson'
SELECT LEN(@name) - LEN(REPLACE(TRANSLATE(@name, 'aeiou', '#####'), '#', ''))

which returns the correct number, 11.

Zero-prefixing numbers

There’s always a need to pad numbers out to a fixed number of digits, e.g. SIC codes, UDPRNs. Here’s an example of the way we used to do it:

SELECT 
		x.number
		,[Prefixed number] = REPLICATE('0', 6-LOG10(x.number)) + CAST(x.number AS VARCHAR)
	FROM (
		VALUES
			(7),(45),(142),(9002),(61023),(987654)
	) x(number)
GO

which gives us:

number      Prefixed number
----------- ---------------
7           000007
45          000045
142         000142
9002        009002
61023       061023
987654      987654

It’s cute: LOG10() of our number tells us the number of digits, REPLICATE() gives us as many zeros as we need.

However, there’s a better way now: in the above query, FORMAT(x.number, ‘000000’) gives us the exact same result. It’s well-worth having a play with the FORMAT() function, it’s very powerful — see here for more details.

(The date formatting is so useful: e.g. SELECT FORMAT(GETDATE(), ‘yyyy-MM-dd HH:mm:ss.fff tt’) gives us 2020-03-22 16:02:38.920 PM… if we really need to mix the 12 and 24 hour clock!)

Concatenating strings column-wise

There are multiple (well-known) ways of concatenating strings row-wise, e.g. now we even have a built-in function, STRING_AGG(); but what about column-wise?

There are two functions you should know about: CONCAT() and CONCAT_WS():

SELECT
		x.word1
		,x.word2
		,x.word3
		,[Concatenate] = CONCAT(x.word1, x.word2, x.word3)
		,[Concatenate w/ sep.] = CONCAT_WS(' + ', x.word1, x.word2, x.word3)
	FROM (
		VALUES
			('First','Second','Third')
			,('Eggs','Bread','Cheese')
			,('Cat',NULL,'Dog')
	) x(word1,word2,word3)
GO

which gives the output:

word1 word2  word3  Concatenate       Concatenate w/ sep.
----- ------ ------ ----------------- -----------------------
First Second Third  FirstSecondThird  First + Second + Third
Eggs  Bread  Cheese EggsBreadCheese   Eggs + Bread + Cheese
Cat   NULL   Dog    CatDog            Cat + Dog

(The _WS means ‘with separator’.)

Two things to note: (1) the NULL in the third row – it’s just ignored, which is often exactly what you want, and why these functions exist, rather than just manually appending ISNULL(…) calls; and (2), the resulting string lengths are only as long as you need, not VARCHAR(MAX).

Last word in a string

Getting the first word from a string is relatively easy – what about the last word? For that, we can use the REVERSE function, which does exactly what you’d expect:

SELECT
		x.Sentence
		,FirstWord = LEFT(x.Sentence, CHARINDEX(' ', x.Sentence) - 1)
		,LastWord = RIGHT(x.Sentence, CHARINDEX(' ', REVERSE(x.Sentence)) - 1)
	FROM (
	VALUES
		('Lorem ipsum dolor sit amet, consectetur adipiscing elit')
		,('Nullam et sem et nisl fringilla mollis')
		,('Donec et commodo magna, nec mattis sapien')
		,('Maecenas in velit urna')
		,('Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae')
		,('First Second')
	) x(Sentence)
GO

which returns:

Sentence                                                                               FirstWord  LastWord
-------------------------------------------------------------------------------------- ---------- ----------
Lorem ipsum dolor sit amet, consectetur adipiscing elit                                Lorem      elit
Nullam et sem et nisl fringilla mollis                                                 Nullam     mollis
Donec et commodo magna, nec mattis sapien                                              Donec      sapien
Maecenas in velit urna                                                                 Maecenas   urna
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae Vestibulum Curae
First Second                                                                           First      Second

In summary

It’s something of a point of pride with me, I will try very hard to solve a problem in SQL first, as long as it’s not going to get silly*. Most common string-parsing problems can be solved in SQL with a bit of thought; if you know of any non-trivial string parsing problems, please let me know!

* I’m sure once I saw someone had written a SQL parser in SQL, which is very clever, but not something I’d ever consider attempting.

, ,

Leave a comment

Bad architecture costs money

It’s pretty obvious: if your database is architected badly to begin with, it costs money to put right. If you’re lucky, missing data can be recreated; in many cases, it cannot. In this example, we were ok — we could infer the missing data in nearly every case, and the cost only amounted to a few weeks of my salary. But still, it was an expenditure that could’ve been avoided by having a five-minute conversation with the developer responsible for writing the code in the first place.

The scope: decisioning

In consumer lending, customers make applications, and their data is run through (one or more) decisioning mechanisms to determine whether to lend the customer money or not. I worked at a lender where the decisioning was initially done by a ruleset of some 20 or so rules. The rules were things like:

  • Has total outstanding loan balance of less than £5000
  • Has 3 or fewer defaults in the last 2 years
  • Has no defaults on any mortgage in the last 3 years
  • Is not currently insolvent
  • Is between the ages of 18 and 75

(The exact rules aren’t important, these are just examples.)

The customer’s data was processed (by some C# code) through each rule, and the output was stored like this:

CustomerID  RuleID RuleResult           RuleResultDate
      1234       1       Pass  2014-03-20 11:27:01.102
      1234       2       Pass  2014-03-20 11:27:01.231
      1234       3       Fail  2014-03-20 11:27:02.614
      ....
      5678       1       Pass  2014-03-20 13:19:58.788
      5678       2      Refer  2014-03-20 13:19:58.941
      5678       3       Pass  2014-03-20 13:20:00.023
      ....

We'll refer to this table of data as RuleResults.

After all the rules had been processed, a final record was put into another table, e.g.:

ApplicationID  CheckTypeID CheckResult
       876543            3        Fail
       876544            3       Refer
       ......

which we'll refer to as CheckResults. The rules around credit history were just one check among many -- other checks included bank account and card validation, fraud checks, and at least one affordability assessment.

If any CheckResult for a customer was a 'fail', we couldn't lend them any money. If no CheckResult was a fail, but there were one or more 'refers', then their application had to be sent to an underwriter who would decide what to do next. (Usually, we requested more information - proof of income, bank statements, etc.)

As you might expect, there was an Applications table that mapped ApplicationID to CustomerID:

ApplicationID CustomerID      ApplicationDate

       876543       1234  2014-03-20 11:23:37 
       876544       5678  2014-03-20 11:24:19
       ......

This was the most important table in the database, everything to do with an application could be tracked back to this table.

For this particular lender, the same customers came back repeatedly (because they liked the product!), so most customers had many application records in the database.

The problem

The problem we had was that this architecture worked perfectly well for the 'live' decisioning of application... but, when it came to reporting and analysis, it wasn't good enough; how in the future should we connect the individual rule results to the overall performance of the loan? Most analyses were done at the application level, not the customer level - not least because a customer could only have one live loan with us at any time.

But our RuleResults table contains the CustomerID, not the ApplicationID. So presumably, to get the correct ApplicationID, we just use the one that were closest in time for the CustomerID, at the times the rules were run? In 90-95% of cases, yes, that works.

However, it's a variation on the old 80/20 rule, i.e. “80% of the work takes 20% of the time, the remaining 20% of the work takes 80% of the time”, but in this case, more like 90/10. Always choosing the nearest ApplicationID doesn't work, because of issues like:

  1. The system or network is busy, so people hit refresh on the page, causing the rules engine to be run twice.
  2. Connections and/or processes dying for any number of reasons. (The website was in use 24 hours a day, but intensive processes were run between midnight and 7am, occasionally causing timeouts and locks.)
  3. Developers manually altering data to push applications through (legitimately: credit files often contain out-of-date information, e.g. claiming someone is still insolvent when the insolvency had already been discharged.)
  4. Developers doing testing and not removing test data
  5. Application dates for future applications being back-dated (again, for legitimate process reasons).

(There were other problems; very occasionally, it was unclear what the cause was.)

Extra fun

And, as you might expect, there were more 'wrinkles' we had to consider:

  1. There often wasn't one set of rule results per application - applications could make multiple passes through the rules engine
  2. An application could pass through one of several sets of rules, depending on criteria that wasn't necessarily recorded at the time
  3. The rule sets changed over time - rules were added and removed
  4. Rules themselves were changed, e.g. "Has total outstanding loan balance of less than £5000" might become "... less than £4000". The definitions of the rules were stored in the database initially, but rarely updated to reflect changes. [We're going to ignore this from now on; matching rule results to the version of the rule that was run was a separate problem, outside the scope of this work]
  5. (and finally) The rules weren't always processed in the same order - largely they were, but it wasn't something that could be relied upon.

How to make sense out of all that?

The Solution

First, let's say what should have happened in the first place. It's fairly clear with hindsight: we needed an ID to group together the passes through the rules, one ID, per pass, per application. That ID also needed to be linked to a RuleSet version, which in turn kept history of the individual rule changes. So those IDs are what we aimed to recreate, 'how it should've been in the first place'.

With data-matching pieces of work like this, I find the best thing to do is go for the 'low-hanging fruit' first. Don't try and match everything in a single mega-query, but break the work up into steps: start with the easiest matches first, remove them from the dataset once matched; go to the next easiest, remove them once matched, and repeat until done. Often, the hardest matches are too fiddly to be done programmatically, but if you're lucky, they are so few that they can be done manually.

(Oh, and yes: keep a record of which particular step is responsible for finding the match - it's invaluable for debugging.)

The first thing we had to do was work out which rules were part of which rulesets, and for what time periods those rulesets were in use. This was a semi-manual process, as there was lots of overlap, e.g. in one time period, it was possible for an application to be run through 12 rules, 13 rules, or 25 rules - so if you saw an application had 25 rule results over 3 minutes, the count alone couldn't tell you if that was (a) two passes, 12 rules then 13 rules, or (b) one pass, 25 rules.

Our RuleSets table looked something like this:

RuleSetID NumRules  CheckSum   StartDate    EndDate
        1       12  34f97e35  2012-01-01 2015-12-31
        2       13  50ab383c  2012-01-01 2015-12-31
        3       25  7c2ffd41  2013-07-09 2015-12-30
      ... 

The CheckSum field was a function of the RuleIDs that formed the RuleSet; e.g.:
    CheckSum = x_1 * RuleID_1 + x_2 * RuleID_2 + ... + x_n * RuleID_n.
If the checksum for a set of rule results doesn't match a checksum in our RuleSets table, then it's not a match.

The matching

Starting with the easiest first:

(1) A sizeable fraction of customers only ever had one application, and hence one pass through decisioning; easy to spot.

(2) Many customers had at most one application per day, and the dates of the rule results tallied with this.

Yes, (1) is a subset of (2) - but (1) is a less expensive query, and removes lots of data from the 'pool of things to match'; so it was quicker to do (1) then (2), than just (2) alone.

(3) The next part is an iterative loop:

  • For each CustomerID*, partition all the rule results using a Gaps and Islands procedure. If:
    1. the previous set of results ended more than S seconds ago
    2. the time taken from the first to last rule result was less than T seconds
    3. the next set starts more than V seconds from now
    4. and the CheckSum for this set matches one of our RuleSets

    then treat this as a match.

  • Repeat, modifying S, T and V as required until no more matches are found.

*Actually, we had some efficiency gains by partitioning 'by CustomerID by day' first, and then another pass 'by CustomerID' alone. Some sets of results straddled midnight, so had to be taken into account.

(4) Penultimately, there were those sets of results where the code had been run more than once at the same time - the results for each pass were interleaved. We partitioned the data by the order in which each rule was seen, and hoped the partitions provided a match - which, by and large, they did.

(5) Finally, what was left at the end of the process was a small handful of records that were then either (a) matched up by eye, or (b) written off as unknown system errors.

In the end, we were extremely happy with the results, and our analysis of the effectiveness of the rules could begin!

, ,

Leave a comment

De-duping within column data

This was some work I had to do recently; the fields (columns) within a dataset were themselves string-separated pieces of individual data, but due to the way the data was gathered, there could be duplicates within each field. We needed to return a dataset that had each piece of data only once per field. Thanks to the STRING_SPLIT and STRING_AGG functions available in MS SQL 2017 onwards, this is pretty straightforward to do.

(Of course, we could’ve still done this work before these functions were natively available in T-SQL. There are many, many examples on, e.g., stackoverflow of SQL functions to split strings, and the converse, to concatenate them back together again.)

An example

First, some dummy data:

IF OBJECT_ID('tempdb.dbo.#Data','U') IS NOT NULL
BEGIN
	DROP TABLE #Data
END
GO

CREATE TABLE #Data
(
	OrganisationID INT NOT NULL PRIMARY KEY CLUSTERED
	,SitePostcodes VARCHAR(255) NOT NULL
	,PhoneNumbers VARCHAR(255) NOT NULL
	,Managers VARCHAR(255) NOT NULL
)
GO

INSERT #Data
(
	OrganisationID
	,SitePostcodes
	,PhoneNumbers
	,Managers
)
VALUES
	(1, 'AB1 2CD,BC3 4DE,CD5 6EF', '0401 3200001,0555 012345','Alan Box,Carrie Dale,Erin Fordham,Graham Hall') -- no repeats
	,(2, 'DE7 8FG,EF9 0GH,DE7 8FG', '0333 123456,0333 123456,0333 123456','Ian Jenkins,Ken Liss,Mavis Norman,Ian Jenkins') -- some repeats
	,(3, 'XX1 2XX,XX1 2XX,XX1 2XX,XX1 2XX', '07111345678,07111345678,07111345678','Orlando Prescott,Orlando Prescott,Orlando Prescott,Orlando Prescott') -- all repeats
GO
-- Let's have a look at our dummy data:
SELECT * FROM #Data
GO

OrganisationID SitePostcodes                    PhoneNumbers                        Managers
-------------- -------------------------------- ----------------------------------- -------------------------------------------------------------------
1              AB1 2CD,BC3 4DE,CD5 6EF          0401 3200001,0555 012345            Alan Box,Carrie Dale,Erin Fordham,Graham Hall
2              DE7 8FG,EF9 0GH,DE7 8FG          0333 123456,0333 123456,0333 123456 Ian Jenkins,Ken Liss,Mavis Norman,Ian Jenkins
3              XX1 2XX,XX1 2XX,XX1 2XX,XX1 2XX  07111345678,07111345678,07111345678 Orlando Prescott,Orlando Prescott,Orlando Prescott,Orlando Prescott

Organisation with ID 1 has no dupes; ID 2 has some, and ID 3 has only dupes in each field.

Here’s my solution to de-duping the fields (it might look like a lot of code, but it’s really straightforward, and only took a few minutes to write):

;WITH cte_Data AS
(
	-- Unpivot our columns into rows:
	SELECT
			OrganisationID
			,[Field]
			,[FieldValue]
		FROM #Data
		UNPIVOT 
		(
			[FieldValue] FOR [Field] IN ([SitePostcodes],[PhoneNumbers],[Managers])
		) u

), cte_Split AS
(
	-- Split the [FieldValue] string by commas:
	SELECT
			d.OrganisationID
			,d.[Field]
			,ss.[value] AS FieldPart
		FROM cte_Data d
		CROSS APPLY STRING_SPLIT(d.[FieldValue], ',') ss

), cte_Dedupe AS
(
	-- Aggregate the data; the GROUP BY effectively
	-- does the de-dupe, removing repeated elements:
	SELECT
			OrganisationID
			,Field
			,FieldPart
		FROM cte_Split 
		GROUP BY OrganisationID, Field, FieldPart

), cte_ReaggregateFieldValues AS
(
	-- Re-aggregate the individual parts of each field,
	-- concatenating the strings with a comma:
	SELECT
			OrganisationID
			,Field
			,STRING_AGG(FieldPart, ',') AS FieldValue_Deduped
		FROM cte_Dedupe
		GROUP BY OrganisationID, Field
)
	-- PIVOT the fields up from rows to columns, and we're done:
	SELECT
			OrganisationID
			,[SitePostcodes]
			,[PhoneNumbers]
			,[Managers]
		FROM cte_ReaggregateFieldValues
		PIVOT
		(
			MIN(FieldValue_Deduped) FOR [Field] IN ([SitePostcodes],[PhoneNumbers],[Managers])
		) p
GO

When we run the above code, we get:

OrganisationID SitePostcodes           PhoneNumbers             Managers
-------------- ----------------------- ------------------------ ---------------------------------------------
1              AB1 2CD,BC3 4DE,CD5 6EF 0401 3200001,0555 012345 Alan Box,Carrie Dale,Erin Fordham,Graham Hall
2              DE7 8FG,EF9 0GH         0333 123456              Ian Jenkins,Ken Liss,Mavis Norman
3              XX1 2XX                 07111345678              Orlando Prescott

If you compare with the original dataset, you can see that the de-duplication has worked.

(In our real-life piece of work, we were running this code over ~2 million rows of data, with 7 different columns that needed deduping. The code ran in about a minute, which was fine — if it had been very slow, I’d have broken the 5 chunks of code up and used temp tables with appropriate indexes.)

As you can see, I’ve organised the code using CTEs, so that it reads from top-to-bottom, rather than using the typical nested SQL that is read from the inside out. It’s a matter of taste, but I find it much easier to read code when it’s written in order like this, plus it allows you to break the code naturally into logical chunks.

Lastly, a minor grumble: STRING_SPLIT only returns the piece of the string between the delimiters, and doesn’t give you the ordinal position; often, I’ll need to know the order within the string, so it means I’ll need to keep my custom versions of string-split functions to hand.

Leave a comment