Bad numbers days

Here’s a scenario that most BI people will be familiar with: It’s 10:30am, and your boss comes up to you: “Usually we’ve had X applications* by this time in the morning, but today we’ve only had Y! What’s going on?”, where of course Y is a number quite a bit smaller than X. What to do? Why are the figures are as low as they’re being reported?

* For ‘applications’, read ‘events’, whatever’s applicable to you

Okay, some part of `the process’ could be broken — but you’ve probably already got monitoring in place to catch things like services falling over, or external providers going down. There are a couple of statistical possibilities — for one, maybe it’s just a matter of Regression toward the mean. In other words, you’ll see the number of applications (events) you were expecting, it’s just that randomness has distributed fewer in the morning, and more in the afternoon. (This has happened to me too often – I spend my lunch break looking for the problem, only to have the numbers gradually return to normal by early afternoon, and over half of the day has been wasted unnecessarily.)

The other possibility: it’s a ‘bad numbers day’, statistically speaking. But how often do they actually occur?

The next bad day

We’re really interested in answering the question, “When can we expect our next bad day?”, but obviously you can never say with any certainty when it’s going to occur. However, we can look at some simulated numbers, for large numbers of hypothetical histories, and see where those bad days occur, on average.

Say the number of applications (‘events’ from hereon) we see in a day is normally distributed with a mean of 100 and standard deviation of 10 (rounded to the nearest integer). Here’s a histogram of 10,000 random samples from N(\mu=100,\sigma=10). The distribution looks pretty reasonable:

All the samples are independent — that is, the number of events on a particular day are not correlated with the number of events on the next day, or any other day.

(If I was a proper statistician, I’d probably be using a Poisson distribution, or even a non-homogenous Poisson as the events don’t occur with the same probability throughout the day. For illustrative purposes, a rounded Normal is fine.)

Let m_k be the kth rolling minimum (‘instance’), where m_0 is our starting minimum, and let v_i represent the number of events on the ith day. Then, for example, if our initial minimum m_0 = 100 (our mean value) and our first 10 values of v_i are {90, 91, 85, 87, 89, 90, 95, 84, 84,86} then we have k=3 minima {90, 85, 84}, such that m_1 = 1, m_2 = 3, m_3 = 8. In other words, day 1 is our worst ever day; the next worst ever day occurs on day 3; and the one after that occurs on day 8.

We’re going repeat this process 10,000 times, and look at a few different schemes, not just ‘worst ever’ (i.e. ‘worst so far’). Starting with m_0 = 100, we’ll calculate the days on which the number of events is:

  • A : less than the current minimum (m_k = i if v_i < m_{k-1})
  • B : less than or equal to the current minimum (m_k = i if v_i \leq m_{k-1})
  • C : less than the current minimum + 1% (m_k = i if v_i < m_{k-1} \times 1.01)
  • D : less than the current minimum + 10% (m_k = i if v_i < m_{k-1} \times 1.10)
  • E : less than the current minimum + 50% (m_k = i if v_i < m_{k-1} \times 1.50)

We’ll restrict ourselves to the first 8 instances (days on which these minimums occur); it’ll save on processing, plus that’s all we need to show the trend.

I’ve uploaded my R code here, but unfortunately it’s a .docx file, as WordPress won’t allow you to upload .R or .txt files, for ‘security’ reasons.


Results

NB: Because the distributions are asymmetrical and have long upper tails, we’ll report the median, rather than the mean.

Scheme A

The results for our first scheme (A) are below:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
A 2 6 17 51 159 470 1309 3148

This says, that for scheme A, the 1st ‘worst day ever’ happened on day 2 for 50% of the 10,000 simulations; subsequently, the next (2nd) ‘worst day ever’ was on day 6, the 3rd was on day 17, etc. Early on, you encounter lots of ‘worst days ever’, but then they stretch out – there’s nearly a year between the 5th and 6th ‘worst ever’ days, and then over two years until the next one. I’d say this feels about right. (NB: Except for the last one, the ratio between one ‘worst day’ and the next is approximately 3.) Below are graphs showing distributions for each ‘kth worst day’:

NB: For n > 1, to make the shape clearer, the graph only shows the bottom 80% of the distribution.

Schemes B, C and D

As you would expect, with the `looser’ schemes, the median bad days are happening sooner:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
B 1 5 13 34 85 200 458 971
C 1 3 7 12 20 32 50 79
D 1 2 4 6 9 13 18 23

Clearly, the ratios between one instance and the previous get smaller, as the criteria are loosened.

Scheme B
Scheme C
Scheme D

Scheme E

The results for scheme E are:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
E 1 2 3 4 5 6 7 8

These figures are really just showing that the range in which the next sample can be considered a minimum is so large, that it’s unsurprising that we’re flagging results as bad. For the first instance (with my particular simulated results), 100% are on day 1; for the 8th instance, 94.2% are on day 8.


End notes

This blog post is the result of an idle thought, after experiencing a `bad numbers day’ at work; my gut feeling was that these days were probably quite frequent, but it looks like they’re more frequent than common-sense might lead you to believe.

(If I had more time to spend on this, I’d look at what the long-term ratios (m_k / m_{k-1}) are for these schemes and others — as mentioned above, for Scheme A it’s about 3, and it’s clearly lower for schemes B to D.)

Of course, the numbers above are purely hypothetical — the real numbers for your situation will likely fluctuate far more on a day-to-day, week-to-week, and month-to-month basis (not to mention intra-day).

In a nutshell: if your numbers are looking low compared to yesterday, don’t panic and assume something’s broken — it might just be probability doing its thing.

, ,

Leave a comment

DATETIMEOFFSET in T-SQL

In my experience, most developers will shudder when you mention DATETIMEOFFSETs to them. I’ve just started working on an application where (most of) the date/time data is stored as DATETIMEOFFSET, and it’s been so long since I’ve had to deal with them, I’ve forgotten most of what I knew! Hence, this blog post to remind myself of the key points, and hopefully it’ll be of use to someone else. (This isn’t supposed to be an exhaustive look at everything to do with DATETIMEOFFSET, just some pointers to the issues I’ve encountered.)

The basics: T-SQL (the variant of SQL used by Microsoft SQL Server) has an extra date/time data type, DATETIMEOFFSET, that stores an offset relating to a timezone. For example, you could represent the time 07:30 (using the 24-hour clock) in Algeria on the 26th July as:

2021-07-26 07:30:00.000 +01:00

And 12:00 (12pm midday) in India as:

2021-07-26 12:00:00.000 +05:30

The datatype stores an offset in minutes; in the string representations shown above, that’s the +01:00 for Algeria, and +05:30 for India.

Coordinated Universal Time, known as UTC, is the ‘base’ timezone, for which the offset is zero minutes, +00:00. In some senses, it’s a successor to Greenwich Mean Time, aka GMT. For our purposes, they’re identical, and interchangeable.

We can see what a DATETIMEOFFSET looks like converted to UTC, using the AT TIME ZONE syntax:

DECLARE @dto_Algeria DATETIMEOFFSET = '2021-07-26 07:30:00.000 +01:00'
DECLARE @dto_India DATETIMEOFFSET   = '2021-07-26 12:00:00.000 +05:30'

SELECT @dto_Algeria AT TIME ZONE 'UTC'
SELECT @dto_India AT TIME ZONE 'UTC'

2021-07-26 06:30:00.0000000 +00:00
2021-07-26 06:30:00.0000000 +00:00

(For all the possible timezones, see e.g. Get a List of Supported Time Zones in SQL Server (T-SQL).)

The times are the same in UTC — if something happens at 7:30am local time in Algeria, and 12pm local time in India, it’s happening simultaneously. We can use the standard DATENAME and DATEPART functions to get the offsets, both the string representation and the value in minutes:

SELECT DATENAME(tzoffset, @dto_Algeria), DATEPART(tzoffset, @dto_Algeria)
SELECT DATENAME(tzoffset, @dto_India), DATEPART(tzoffset, @dto_India)

+01:00    60
+05:30   330

All the usual date/time functions (DATENAME, DATEPART, DATEADD, DATEDIFF) work with DATETIMEOFFSET in the same way they do with the DATETIME and DATETIME2 datatypes, although you might need to be careful when doing a DATEDIFF between DATETIMEOFFSETs with different offsets. So far, so straightforward — why would anyone have issues working with DATETIMEOFFSETs?

The problems

1. Nomenclature

People mix up offset and timezone, all the time. However, they aren’t synonyms.

SQL Server uses the timezone — e.g. ‘UTC’, ‘Central European Standard Time’, ‘India Standard Time’ — to determine the offset: +00:00, +01:00, +05:30. You can’t get back to the timezone by just knowing the offset. The DATETIMEOFFSET datatype doesn’t store the timezone, just the offset.

Annoyingly, to get the offset using DATENAME/DATEPART, SQL Server lets you abbreviate tzoffset to tz, which leads people to refer to timezone and offset interchangeably.

How do we get the current timezone that SQL Server is using? In modern versions, it’s simply:

SELECT CURRENT_TIMEZONE()

but in older versions, you have to retrieve it from the registry:

DECLARE @TimeZone VARCHAR(50)
EXEC MASTER.dbo.xp_regread
	'HKEY_LOCAL_MACHINE'
	,'SYSTEM\CurrentControlSet\Control\TimeZoneInformation'
	,'TimeZoneKeyName',@TimeZone OUT
SELECT @TimeZone
GMT Standard Time

(taken from here: Get the Current Time Zone of the Server in SQL Server (T-SQL))

2. DST – Daylight Saving Time

The timezone isn’t the only thing that affects the offset — it’s whether DST, Daylight Saving Time, is being applied as well. I chose India and Algeria in the examples above because neither country uses DST. Here in the UK, we ‘gain’ an hour in March and ‘lose’ an hour in October, but not on fixed dates. For example, you can look here When do the clocks change? to see when the next relevant date is. (This period in the UK summer is called British Summer Time, BST). Whether the DATETIMEOFFSET has been determined using DST is not stored in the datatype. If this is relevant to you, you’ll have to keep track of this yourself.

NB: As far as I can tell, all countries in Europe observe DST.

3. SQL Server names the timezones incorrectly

For me, this is the biggest irritant: in SQL Server, the timezone ‘GMT Standard Time’ is not GMT, it is the timezone of the British Isles, i.e. it includes BST. We can show this quite easily:

;WITH cte_Numbers AS
(
	-- Alternative to using a 'Numbers' table:
	SELECT
		TOP 10
		N = ROW_NUMBER() OVER (ORDER BY v1.number)
	FROM master..spt_values v1
	CROSS JOIN master..spt_values v2

), cte_Date AS
(
	SELECT
		N
		,dt = DATEADD(second, N-1, '2021-03-28 00:59:55')
	FROM cte_Numbers

), cte_AtTimeZone AS
(
	SELECT
		*
		,dto = dt AT TIME ZONE 'GMT Standard Time'
	FROM cte_Date
)
SELECT
	N
	,dt
	,dto
	,[tzoffset] = DATEPART(tzoffset, dto)
FROM cte_AtTimeZone
GO
N    dt                      dto                                tzoffset
---- ----------------------- ---------------------------------- -----------
1    2021-03-28 00:59:55.000 2021-03-28 00:59:55.000 +00:00     0
2    2021-03-28 00:59:56.000 2021-03-28 00:59:56.000 +00:00     0
3    2021-03-28 00:59:57.000 2021-03-28 00:59:57.000 +00:00     0
4    2021-03-28 00:59:58.000 2021-03-28 00:59:58.000 +00:00     0
5    2021-03-28 00:59:59.000 2021-03-28 00:59:59.000 +00:00     0
6    2021-03-28 01:00:00.000 2021-03-28 02:00:00.000 +01:00     60
7    2021-03-28 01:00:01.000 2021-03-28 02:00:01.000 +01:00     60
8    2021-03-28 01:00:02.000 2021-03-28 02:00:02.000 +01:00     60
9    2021-03-28 01:00:03.000 2021-03-28 02:00:03.000 +01:00     60
10   2021-03-28 01:00:04.000 2021-03-28 02:00:04.000 +01:00     60

(March 28th was when we entered BST this year — it’s always a Sunday at 1am.) As you can clearly see, GMT in SQL Server terms isn’t proper GMT, because it has taken BST into account. If you want real GMT, you have to convert to the timezone ‘UTC’.

Note that you can convert a DATETIME to a DATETIMEOFFSET using ‘AT TIME ZONE’, but the implicit conversion to DATETIMEOFFSET gives you the version with millisecond resolution, i.e. 3 decimal places; but you can have up to 7 decimal places (see the documentation here datetimeoffset (Transact-SQL) for more information).

It’s not just ‘GMT Standard Time’, other timezones are affected: e.g. “Central European Time (CET) is a standard time which is 1 hour ahead of Coordinated Universal Time (UTC)” (as detailed here), but in fact, SQL Server incorporates ‘Central European Summer Time’ into this timezone to put it an hour ahead in the summer. I wouldn’t be surprised if other timezones are treated in the same way — please check before you use them!

NB: I can see that someone has raised a similar issue here: W. Europe Standard time appears to be incorrect by 1 hour

4. Comparing with DATETIME / DATETIME2

Here’s the biggest issue I’ve come across: comparing DATETIME and DATETIME2 with DATETIMEOFFSET. You have to be very careful, you cannot just mix OFFSET data with non-OFFSET and hope everything’ll be alright – it won’t. This catches people out:

SELECT GETDATE()
SELECT SYSDATETIMEOFFSET()
SELECT CAST(GETDATE() AS DATETIMEOFFSET)
SELECT GETDATE() AT TIME ZONE 'GMT Standard Time'
2021-07-31 18:27:24.187
2021-07-31 18:27:24.1871686 +01:00
2021-07-31 18:27:24.1866667 +00:00
2021-07-31 18:27:24.187 +01:00

Look at the third result: we’ve CAST the current date/time to a DATETIMEOFFSET, but the offset it’s returning us is ‘+00:00’, not the ‘+01:00’ we were expecting (seeing as the server is set to ‘GMT Standard Time’, which is currently an hour ahead of UTC). If we want the correct offset, we either use the syntax from the fourth SELECT statement, ‘AT TIME ZONE…’ or we can add the offset ourselves manually:

SELECT TODATETIMEOFFSET(CAST(GETDATE() AS DATETIMEOFFSET), 60)
2021-07-31 18:27:24.1866667 +01:00

If you see SQL code comparing a DATETIMEOFFSET with a function of GETDATE() or SYSDATETIME(), then it needs thoroughly checking – there’s a good chance it’s wrong!

Summary

Using offsets and timezones takes some getting used to, but I’ve found that as long as I remember not to mix OFFSET and non-OFFSET data and types, then everything’s generally ok. There’s a blog post that could be written about the wisdom of using the DATETIMEOFFSET datatype in the first place — in the few places I’ve seen it used, it was definitely a case of ‘premature globalisation’, assuming the same database was going to be used if/when the company expanded overseas. In fact, for the application I’m working on, the UK and US databases have completely separate databases and code, so it’s a non-issue.

, , , , , , , ,

Leave a comment

Speeding up a FULL OUTER JOIN

Note: I am fully expecting someone to come along and comment that the reason why this works is obvious. That’s fine, at least I’ll learn something!

Usually in my blog posts, I present something and have a reasonable understanding of why that something is happening. Not so here – I’m going to show how I (massively) sped up a query, but I really don’t know why it works.

Some background: I was given a fairly lengthy query to improve, and was quietly confident I could make it more efficient and much faster. Of course, ‘pride comes before a fall’*, and for all my tweaking**, it didn’t make a great deal of difference. Eventually I realised that there was a FULL OUTER JOIN near the bottom of the code, and that was taking up the majority of the processing.

* Turns out that’s from Proverbs 16:18 — it’s always the Bible or Shakespeare, isn’t it.

** A lot of my ‘tweaks’ involve splitting out tables involved in JOINs into temp tables first, which I can then index. Quite often, the performance gains are impressive. Occasionally, they aren’t; c’est la vie!

By the time I finished with the rest of the query, the problematic piece of the code looked like this:

SELECT 
	{...columns ...}
FROM #IndexedTable1 t1
FULL OUTER JOIN #IndexedTable2 t2
	ON t2.IndexedKey = t1.IndexedKey

with about 12,000 records in one table, and 4,000 in the other. Here’s the important bit: the IndexedKey on each table was NULL-able, and a good proportion of the values were NULL. But what if they weren’t NULL, would that speed things up? Turns out the answer is yes…

Worked example

First, we need a couple of tables to join. I’ve manufactured them so that there is some pre-defined overlap by a UNIQUEIDENTIFIER (GUID/UID) key, named KeyID.

DROP TABLE IF EXISTS #MatchingKeys
GO

-- This table contains the UIDs that will be common to both tables
CREATE TABLE #MatchingKeys
(
	MatchingKeyID UNIQUEIDENTIFIER NOT NULL
)
GO

INSERT #MatchingKeys(MatchingKeyID)
	SELECT 
			MatchingKeyID = NEWID()
		FROM dbo.Numbers -- A standard Numbers table
		WHERE Number <= 3000
GO

-- Table One:
DROP TABLE IF EXISTS #TableXOne
GO

CREATE TABLE #TableXOne
(
	TableXOneID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyInt INT NOT NULL
)
GO

INSERT #TableXOne(TableXOneID,KeyID,MyInt)
SELECT
		TableXOneID = ROW_NUMBER() OVER (ORDER BY NEWID())
		,KeyID
		,MyInt = CAST(CAST(NEWID() AS VARBINARY(16)) AS INT)
	FROM
	(
		SELECT
				KeyID = MatchingKeyID
			FROM #MatchingKeys
		UNION ALL
		SELECT
				KeyID = CASE WHEN Number <= 3000 THEN NEWID() ELSE NULL END
			FROM dbo.NumbersTest 
			WHERE Number <= 57000
	) x
GO

-- Table Two:
DROP TABLE IF EXISTS #TableXTwo
GO

CREATE TABLE #TableXTwo
(
	TableXTwoID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyDate DATETIME NOT NULL
)
GO

INSERT #TableXTwo(TableXTwoID,KeyID,MyDate)
SELECT
		TableXTwoID = ROW_NUMBER() OVER (ORDER BY NEWID())
		,KeyID
		,MyDate = DATEADD(ms, CAST(CAST(NEWID() AS VARBINARY(16)) AS INT), GETDATE())
	FROM
	(
		SELECT
				KeyID = MatchingKeyID
			FROM #MatchingKeys
		UNION ALL
		SELECT
				KeyID = CASE WHEN Number <= 3000 THEN NEWID() ELSE NULL END
			FROM dbo.NumbersTest 
			WHERE Number <= 57000
	) x
GO

A quick look at our data:

SELECT COUNT(1) AS N, SUM(IIF(KeyID IS NOT NULL, 1, 0)) AS NumKeys FROM #TableXOne
SELECT COUNT(1) AS N, SUM(IIF(KeyID IS NOT NULL, 1, 0)) AS NumKeys FROM #TableXTwo
GO

N           NumKeys
----------- -----------
60000       6000

N           NumKeys
----------- -----------
60000       6000
SELECT TOP 100 * FROM #TableXOne ORDER BY TableXOneID
SELECT TOP 100 * FROM #TableXTwo ORDER BY TableXTwoID
GO
TableXOneID KeyID                                MyInt
----------- ------------------------------------ -----------
1           NULL                                 1865912662
2           NULL                                 1357143369
3           NULL                                 1464599252
4           NULL                                 1142728281
5           NULL                                 -629295448
6           NULL                                 779829015
7           BC02CED1-59E5-4F53-B3F0-5BAD4C92E9AE -291591389
8           NULL                                 -864066744
9           NULL                                 1026552142
10          NULL                                 1896807348
...
TableXTwoID KeyID                                MyDate
----------- ------------------------------------ -----------------------
1           NULL                                 2021-06-21 07:45:30.263
2           NULL                                 2021-06-14 18:55:52.193
3           NULL                                 2021-06-17 06:02:06.110
4           NULL                                 2021-06-30 06:46:41.193
5           NULL                                 2021-07-08 21:06:37.897
6           NULL                                 2021-06-30 22:44:31.940
7           NULL                                 2021-07-04 19:09:34.523
8           NULL                                 2021-06-16 10:58:35.430
9           DD67103D-A849-47BE-B04F-F96D3852E02F 2021-06-17 15:46:03.147
10          NULL                                 2021-06-24 18:13:44.010
...

(MyInt and MyDate are irrelevant, they’re just there to take up space.) Let’s look at some counts:


SELECT
	[# same] = (
		SELECT COUNT(1)
			FROM #TableXOne t1
			JOIN #TableXTwo t2
				ON t2.KeyID = t1.KeyID
	
	)
	,
	[# FOJ] = (
		SELECT COUNT(1)
			FROM #TableXOne t1
			FULL OUTER JOIN #TableXTwo t2
				ON t2.KeyID = t1.KeyID
	)
GO

# same      # FOJ
----------- -----------
3000        117000

As per our design, there are 3,000 records where the same KeyID (UID) exists in both tables, and 117,000 results if we did a FULL OUTER JOIN (that is, 60000 + 60000 – 3000 = 117000).

Ok, so now we’ll look at some performance stats:

DROP TABLE IF EXISTS #ResultsA
GO

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	INTO #ResultsA
	FROM #TableXOne t1
	FULL OUTER JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID
GO
Table '#TableXTwo___...'. Scan count 1, logical reads 277, physical reads 0 [...]
Table '#TableXOne___...'. Scan count 1, logical reads 246, physical reads 0 [...]

SQL Server Execution Times:
  CPU time = 60390 ms,  elapsed time = 65435 ms.

Ouch, over a minute. (This is on a PC with zero load.) Let’s try an alternative formulation of the query that should give us identical results:

DROP TABLE IF EXISTS #ResultsA2
GO

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	INTO #ResultsA2
	FROM #TableXOne t1
	JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID

UNION ALL

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	FROM #TableXOne t1
	LEFT JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID
	WHERE t2.KeyID IS NULL

UNION ALL

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	FROM #TableXTwo t2
	LEFT JOIN #TableXOne t1
		ON t1.KeyID = t2.KeyID
	WHERE t1.KeyID IS NULL
GO

Table '#TableXTwo___...'. Scan count 3, logical reads 831, physical reads 0 [...]
Table '#TableXOne___...'. Scan count 3, logical reads 738, physical reads 0 [...]

SQL Server Execution Times:
  CPU time = 62047 ms,  elapsed time = 68365 ms.

Effectively the same result, just the expected increase in scans/logical reads. We’ll quickly check the results are identical:

SELECT * FROM #ResultsA EXCEPT SELECT * FROM #ResultsA2
GO
SELECT * FROM #ResultsA2 EXCEPT SELECT * FROM #ResultsA
GO
-- Empty recordsets, the tables are the same.

This is a very slow query — how can we speed it up?

The hack

Firstly, we’ll take copies of our two tables (naming them Y instead of X); and we’ll (a) replace NULL entries in KeyID with a ‘fake’ GUID, and (b) keep track of this with a new flag, IsFakeID.

DROP TABLE IF EXISTS #TableYOne
GO

CREATE TABLE #TableYOne
(
	TableYOneID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyInt INT NOT NULL
	,IsFakeID BIT NOT NULL
)
GO

DROP TABLE IF EXISTS #TableYTwo
GO

CREATE TABLE #TableYTwo
(
	TableYTwoID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyDate DATETIME NOT NULL
	,IsFakeID BIT NOT NULL
)
GO

INSERT #TableYOne(TableYOneID, KeyID, MyInt, IsFakeID)
SELECT
		TableXOneID
		,KeyID = ISNULL(x.KeyID, NEWID())
		,x.MyInt
		,IsFakeID = CASE WHEN x.KeyID IS NOT NULL THEN 0 ELSE 1 END
	FROM #TableXOne x
GO

INSERT #TableYTwo(TableYTwoID, KeyID, MyDate, IsFakeID)
SELECT
		TableXTwoID
		,KeyID = ISNULL(x.KeyID, NEWID())
		,x.MyDate
		,IsFakeID = CASE WHEN x.KeyID IS NOT NULL THEN 0 ELSE 1 END
	FROM #TableXTwo x
GO

Now let’s do the FULL OUTER JOIN on these tables, and save the result. Note that in the SELECT, if our KeyID is ‘fake’, we put the NULL back as it was originally.

DROP TABLE IF EXISTS #ResultsB
GO

SELECT
		t1.TableYOneID
		,IIF(t1.[IsFakeID] = 1, NULL, t1.KeyID) AS KeyID1
		,t1.MyInt
		,t2.TableYTwoID
		,IIF(t2.[IsFakeID] = 1, NULL, t2.KeyID) AS KeyID2
		,t2.MyDate
	INTO #ResultsB
	FROM #TableYOne t1
	FULL OUTER JOIN #TableYTwo t2
		ON t2.KeyID = t1.KeyID
GO
Table '#TableYTwo___...'. Scan count 1, logical reads 142, physical reads 0 [...]
Table '#TableYOne___...'. Scan count 1, logical reads 128, physical reads 0 [...]

SQL Server Execution Times:
 CPU time = 47 ms,  elapsed time = 58 ms.

Quite the speed up: from over a minute to 58 milliseconds! Let’s check the results are the same:

SELECT * FROM #ResultsA EXCEPT SELECT * FROM #ResultsB
GO
SELECT * FROM #ResultsB EXCEPT SELECT * FROM #ResultsA
GO
-- Empty recordsets, the tables are the same.

Unfortunately, my knowledge of Execution Plans isn’t good enough to determine why the huge difference exists; as far as I can tell, the plans are identical (save an extra ‘Compute Scalar’ step in the ‘hack’ version, due to putting the ‘real’ NULLs back). The estimated number of rows for the Hash Match in the ‘hack’ version are about half the final number, which may be a clue.

Even more curiously..?

In my original work, the common keys were (already) indexed; in the example above, they’re not. If I add indexes on the key to all the tables, then the ‘non-hack’ version doesn’t use them anyway — maybe that’s not surprising, it’s sparse data. If I force the query to use the indexes, using WITH(INDEX(…)), then the query takes longer than I have patience for (it didn’t finish within 5 minutes). If I force indexes on the ‘hack’ version, then it takes about a second, but there are a lot more reads.

Summary

It’s pretty simple: this ‘hack’ works, I can show it works (it gives identical results to the ‘non-hack’ version), and it sped up the query greatly. But I don’t know why it works — moreover, why couldn’t SQL Server be doing this under the hood? (There must be a reason!) When I find out, I’ll put an update here.

, , ,

Leave a comment

Searching a database for a specific value

I never thought I’d ever need to write this code, but seeing as I did, I may as well share it! I hope you never have to resort to using it…

Recently, I’ve been working on a project that involved Microsoft SQL Server databases from a couple of third-party vendors (one of whom is well-known in the industries I work in). Somewhat shockingly, neither of these vendors provide any documentation for their databases — we were given database backups, and had to work out the structure and ‘what column names represent’ for ourselves. One of the databases wasn’t too bad, the table and column names made some sense; the other was highly normalised (over-normalised, in my opinion) and many of the objects were cryptically named.

Both databases contained data that we (as a company) had submitted, so we knew that certain values had to be present in the databases somewhere. The databases were too large to just run SELECT * on all the tables and manually look for our data, so I wrote SQL code to find the values we were looking for. In this way, we managed to piece together the queries we required. (Having documentation would’ve been preferable, but needs must…)

So, here’s the code I wrote; it’s not a stored procedure, you just paste it in to a query window, put in your requirements at the top, then run it.

I’ll demonstrate using Microsoft’s AdventureWorks database (AdventureWorks2017 on my PC).

IMPORTANT CAVEATS:

  1. Although this code has run very quickly every time I’ve used it, I was running it on an unimportant server, where more often than not, I was the only user. Be careful not to bring your server down! If in doubt, talk to your DBA.
  2. The code wasn’t written to be particularly friendly, so if you have specific requirements, you’ll have to tweak it.
  3. There’s nothing clever about this code, it’s just brute forcing the problem.

The code

-- You have to give values for the following 3 variables:
DECLARE @TypesOfInterest NVARCHAR(MAX) = 'int,smallint,bigint,tinyint'
DECLARE @ValueOfInterest NVARCHAR(MAX) = '19237'
DECLARE @StopWhenFound BIT = 0

-- ****************************************************************************
-- PART ONE: Get the tables/columns that match our datatypes of interest:
-- ****************************************************************************
DROP TABLE IF EXISTS #MatchingCols

;WITH cte_AllColumns AS
(
    SELECT
            TableName        = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']'
            ,ColumnName      = '[' + c.COLUMN_NAME + ']'
            ,ColSortOrder    = c.ORDINAL_POSITION
            ,PK              = IIF(pk.COLUMN_NAME IS NOT NULL, 1, 0)
            ,MatchingType    = IIF(x.DATA_TYPE IS NOT NULL, 1, 0)
        FROM INFORMATION_SCHEMA.COLUMNS c
        JOIN INFORMATION_SCHEMA.TABLES t
            ON t.TABLE_CATALOG = c.TABLE_CATALOG
            AND t.TABLE_SCHEMA = c.TABLE_SCHEMA
            AND t.TABLE_NAME = c.TABLE_NAME
            AND t.TABLE_TYPE = 'BASE TABLE' -- Note: i.e. not views, just tables
        LEFT JOIN (
            SELECT DATA_TYPE = s.[value]
                FROM STRING_SPLIT(@TypesOfInterest, ',') s
        ) x
            ON x.DATA_TYPE = c.DATA_TYPE
        LEFT JOIN (
            SELECT
                    ku.TABLE_CATALOG
                    ,ku.TABLE_SCHEMA
                    ,ku.TABLE_NAME
                    ,ku.COLUMN_NAME
                FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS tc
                JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS ku
                    ON ku.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
                WHERE tc.CONSTRAINT_TYPE = 'PRIMARY KEY' 
        ) pk
            ON pk.TABLE_CATALOG  = c.TABLE_CATALOG
            AND pk.TABLE_SCHEMA  = c.TABLE_SCHEMA
            AND pk.TABLE_NAME    = c.TABLE_NAME
            AND pk.COLUMN_NAME   = c.COLUMN_NAME

), cte_StringAgg AS
(
    SELECT 
            TableName
            ,SELECT_Cols = STRING_AGG(IIF((PK = 1) OR (MatchingType = 1 AND PK = 0), ColumnName, NULL),',')
                                WITHIN GROUP (ORDER BY ColSortOrder)
            ,WHERE_Cols  = STRING_AGG(IIF(MatchingType = 1, ColumnName + ' = ##', NULL), ' OR ')
                                WITHIN GROUP (ORDER BY ColSortOrder)
        FROM cte_AllColumns
        GROUP BY TableName
)
    SELECT
            ID = ROW_NUMBER() OVER (ORDER BY TableName)
            ,TableName
            ,SELECT_Cols
            ,WHERE_Cols
        INTO #MatchingCols
        FROM cte_StringAgg
        WHERE WHERE_Cols IS NOT NULL

-- ****************************************************************************
-- PART TWO: Loop through the tables:
-- ****************************************************************************

DROP TABLE IF EXISTS ##Results
-- Note: Global temp table so I can look at the results
-- as they're inserted, in a different query window.

CREATE TABLE ##Results
(
    TableName NVARCHAR(256) NOT NULL PRIMARY KEY CLUSTERED
    ,[Query] NVARCHAR(MAX) NOT NULL
    ,CreatedOn DATETIME NOT NULL DEFAULT(GETDATE())
)

DECLARE @id INT = 1, @maxid INT, @SELECT_sql NVARCHAR(MAX), @INSERT_sql NVARCHAR(MAX), @TableName NVARCHAR(256), @rc INT
SELECT @maxid = MAX(id) FROM #MatchingCols
WHILE (@id <= @maxid)
BEGIN

    SET @TableName  = NULL
    SET @SELECT_sql = NULL
    SET @INSERT_sql = NULL
    SET @rc         = NULL

    SELECT
            @TableName    = m.TableName
            ,@SELECT_sql  = N'SELECT TOP 1 ' + m.SELECT_Cols + N' FROM ' + m.TableName + N' WHERE (' + REPLACE(m.WHERE_Cols, '##', @ValueOfInterest) + N')'
        FROM #MatchingCols m
        WHERE id = @id
    PRINT @SELECT_sql

    SELECT @INSERT_sql = N'IF EXISTS(' + @SELECT_sql + N') INSERT ##Results(TableName, [Query]) VALUES(''' + @TableName + N''', ''' + @SELECT_sql + N''')'
    EXEC sp_executesql @stmt = @INSERT_sql
    SET @rc = @@ROWCOUNT

    IF @StopWhenFound = 1
    BEGIN
        IF @rc > 0 BREAK
    END

    SET @id = @id + 1
END
GO
-- And finally, the results:
SELECT * FROM ##Results
GO

(NB: The part of the query that indicates which columns are involved in the primary key, I took from this StackOverflow post.)

The three pieces of information you have to supply (as seen at the top of the code) are:

  • @TypesOfInterest : A comma-separated list of valid SQL Server datatypes (as found in the table sys.types)
  • @ValueOfInterest : The piece of information you’re looking for. If you’re searching for a string, you’ll have to put it in quoted quotes, e.g.
    DECLARE @ValueOfInterest NVARCHAR(MAX) = ”’Dobney”’ (3 single quotes each side of the word/phrase)
  • @StopWhenFound : If 1, the code stops when the value is found for the first time. If 0, all tables are searched.

When I run this in the AdventureWorks2017 database, I get

If I select the ‘Query’ field in 8th row, copy it out to another query window (removing the TOP 1):

SELECT [TransactionID],[ProductID],[ReferenceOrderID],[ReferenceOrderLineID],[Quantity]
FROM [Production].[TransactionHistoryArchive]
WHERE ([TransactionID] = 19237 OR [ProductID] = 19237 OR [ReferenceOrderID] = 19237
	OR [ReferenceOrderLineID] = 19237 OR [Quantity] = 19237)

and then run it, we get:

and we’ve found a couple of instances of our search value, 19237.

Suggested improvements

This is supposed to be a piece of code of limited use, so I’m not planning to spend any more time making it more user-friendly. However, if I was going to do so:

  1. I’d record the primary key values in the SELECT statement (the Query column in the table ##Results), so SQL Server could go straight to the record, rather than (potentially) doing another table scan — but it doesn’t really matter, as the query results will probably be cached anyway.
  2. I’d generate a field that shows exactly which columns matched the value we’re looking for.
  3. Maybe the input could be friendlier, automatically adding quotes to string types.

Anyway, hopefully this code will be useful to someone — do let me know if you’ve been driven to use it!

, , ,

Leave a comment

CREATE the definition of a temp table (in SQL)

At time of writing, the main MS SQL database I work with day-to-day has hundreds of tables, and many of those tables have hundreds of columns. Alongside these tables are hundreds of views, sprocs, and assorted functions. Writing code therefore has this extra overhead — working out where data comes from, and how it’s modified — that has to be factored in when it comes to estimating how long pieces of work will take.

Hence, I find that I’m always ‘writing code to write code’ – sometimes dynamic SQL, sometimes pasting column names in and out of Excel, but often just writing ‘helper’ queries that return recordsets that contain, e.g. lists of columns, that I paste back into my SQL query window. Below, I’ll demonstrate the kind of code I use to create table definitions from existing temp tables.

Why do we need this? It’s useful because you can do SELECT {columns} INTO #MyTempTable FROM {tables} without having to specify any datatypes – the types of the columns in the temporary table #MyTempTable are derived behind the scenes. Using the code below, I can turn my temp tables into real ones.

To begin with, we’ll dummy up a temp table, #MyTempTable, full of random data:

DROP TABLE IF EXISTS #MyTempTable
GO

;WITH cte_Random AS
(
	SELECT
			ID = [Number]
			,Var_Int08 = CAST(CAST(NEWID() AS BINARY(16)) AS TINYINT) 
			,Var_Int16 = CAST(CAST(NEWID() AS BINARY(16)) AS SMALLINT) 
			,Var_Int32 = CAST(CAST(NEWID() AS BINARY(16)) AS INT) 
			,Var_Int64 = CAST(CAST(NEWID() AS BINARY(16)) AS BIGINT) 
	FROM (
	SELECT TOP 1000
		[Number] = CAST(ROW_NUMBER() OVER (ORDER BY object_id, column_id) AS INT)
		FROM sys.columns
	) x
)
SELECT
	*
	,Var_Float24		= CAST(1.0 * Var_Int32 / Var_Int16 AS FLOAT(24))
	,Var_Float53		= CAST(1.0 * Var_Int64 / Var_Int32 AS FLOAT(53))
	,Var_Decimal1801	= CAST(1.0 * Var_Int32 / Var_Int16 AS DECIMAL(18,1))
	,Var_Decimal3810	= CAST(1.0 * Var_Int64 / Var_Int32 AS DECIMAL(38,10))
	,Var_Date		= CAST(DATEADD(day, Var_Int16, '2021Mar09') AS DATE)
	,Var_DateTime		= DATEADD(s,  Var_Int32, CAST('2021Mar09' AS DATETIME))
	,Var_DateTime2		= DATEADD(ns, Var_Int32, CAST('2021Mar09' AS DATETIME2))
	,Var_Char10		= CAST(CAST(NEWID() AS BINARY(16)) AS CHAR(10))
	,Var_Varchar20		= CAST(CAST(NEWID() AS BINARY(16)) AS VARCHAR(20))
	,Var_Nvarchar40		= CAST(CAST(NEWID() AS BINARY(16)) AS NVARCHAR(40))
	,Var_SmallMoney 	= CAST(Var_Int32 / 10000.0 AS SMALLMONEY)
	,Var_Money 		= CAST(Var_Int64 / 10000.0 AS MONEY)
INTO #MyTempTable
FROM cte_Random
GO
-- Let's say we require the ID column to be non-NULL:
ALTER TABLE #MyTempTable ALTER COLUMN ID INT NOT NULL
GO

I’ve included several of the different datatypes, but not all of them; there are quite a few recognised by MS SQL (you can read about them here: Data types (Transact-SQL) and An overview of SQL Server data types), but these ones above are the data types I deal with on a typical day.

(NB: It works exactly the same for global temp tables, so we could’ve called our table ##MyTempTable instead.)

Let’s look at some of the data:

SELECT TOP 10 * FROM #MyTempTable

ID   Var_Int08 Var_Int16 Var_Int32   Var_Int64             Var_Float24
---- --------- --------- ----------- --------------------- ------------
1    14        -30998    658979450   -9167272261900257581  -21258.8
2    137       -27919    -341309392  -6857223736286779664  12225
3    133       -30579    -1481699359 -4780055374266335512  48454.8
4    84        26124     2985519     -7160508429147909805  114.283
5    61        14039     842379332   -7969292711582563318  60002.8
6    89        7825      -32357595   -7605166744987937108  -4135.16
7    153       29284     1888870407  -8917976039370151771  64501.8
8    40        -31383    -1906986840 -9210204572955484902  60765
9    149       -7316     -1955688049 -8633019267239556802  267317
10   202       -12795    901983213   -7712186477348472978  -70495

(I’ve only shown the first few columns)

Looks ok, random enough… apart from Var_Int64, which is always negative? The contents of the table don’t matter at all for our current purposes, but we’ll come back to this at the end, because it’s quite interesting!

First, we need to know how MS SQL refers to our temp table. You may know that tables in the tempdb database (where temp tables live) have slightly modified names:

SELECT object_id, name FROM tempdb.sys.objects WHERE name LIKE N'#MyTempTable%'

object_id	name
------------    -------------------------------------------------------------
-1514606060	#MyTempTable____{....lots of underscores...}_____000000000003

The object_id and name will be different on your server. The reason for the underscores / digits after #MyTempTable is because more than one user can create a temp table with the same name, so under the hood, SQL Server gives the table a unique name to keep track of it — however, we don’t need to know this; as long as we stay in the same session, we can keep referring to the table as #MyTempTable.

ANNOYING NOTE: I use Azure SQL Server, which drops temp tables after a few minutes if the session hasn’t seen any activity. I’ve yet to determine whether the amount of time is something our administrators can control (they don’t think so). I’ve added this to my ever-increasing list of ‘Things About Azure I Don’t Like’!

Anyway, we know where our table lives, so we can use the standard view INFORMATION_SCHEMA.COLUMNS to show us the column definitions. We’ll pull out the information we need to be able to create the SQL datatype fragments for each column, and store it all in a temp table:

DROP TABLE IF EXISTS #ColDefs
GO

DECLARE @MyTableName NVARCHAR(128) = '#MyTempTable'

;WITH cte_Columns AS
(
	SELECT
		COLUMN_NAME
		,ORDINAL_POSITION
		,IS_NULLABLE
		,DATA_TYPE
		,CHARACTER_MAXIMUM_LENGTH
		,NUMERIC_PRECISION
		,NUMERIC_SCALE
	FROM tempdb.INFORMATION_SCHEMA.COLUMNS c
	WHERE c.TABLE_NAME = (
		SELECT [name]
			FROM tempdb.sys.objects
			WHERE OBJECT_ID = OBJECT_ID('tempdb.dbo.' + @MyTableName)
	)
)
SELECT 
	ColName = '[' + COLUMN_NAME + ']'
	,ColType =
		CASE WHEN DATA_TYPE = 'real' THEN 'FLOAT' ELSE UPPER(DATA_TYPE) END
		+ CASE
		WHEN DATA_TYPE IN ('char','varchar','nvarchar') THEN '(' + CAST(CHARACTER_MAXIMUM_LENGTH AS VARCHAR) + ')'
		WHEN DATA_TYPE IN ('decimal','numeric') THEN '(' + CAST(NUMERIC_PRECISION AS VARCHAR) + ',' + CAST(NUMERIC_SCALE AS VARCHAR) + ')'
		WHEN DATA_TYPE IN ('float','real') THEN '(' + CAST(NUMERIC_PRECISION AS VARCHAR) + ')'
		ELSE '' END
	,IsNullable = IIF(IS_NULLABLE = 'YES', 1, 0)
	,SortOrder = ORDINAL_POSITION
INTO #ColDefs
FROM cte_Columns
ORDER BY ORDINAL_POSITION
GO

Note that numeric and decimal are synonyms, and real is equivalent to float(24). If you’re going to add in extra data types, you’ll have to modify the code that sets ColType.

So, what do we have in our #ColDefs table?

SELECT * FROM #ColDefs

ColName           ColType         IsNullable  SortOrder
----------------- --------------- ----------- -----------
[ID]              INT             0           1
[Var_Int08]       TINYINT         1           2
[Var_Int16]       SMALLINT        1           3
[Var_Int32]       INT             1           4
[Var_Int64]       BIGINT          1           5
[Var_Float24]     FLOAT(24)       1           6
[Var_Float53]     FLOAT(53)       1           7
[Var_Decimal1801] DECIMAL(18,1)   1           8
[Var_Decimal3810] DECIMAL(38,10)  1           9
[Var_Date]        DATE            1           10
[Var_DateTime]    DATETIME        1           11
[Var_DateTime2]   DATETIME2       1           12
[Var_Char10]      CHAR(10)        1           13
[Var_Varchar20]   VARCHAR(20)     1           14
[Var_Nvarchar40]  NVARCHAR(40)    1           15
[Var_SmallMoney]  SMALLMONEY      1           16
[Var_Money]       MONEY           1           17

Now we have enough information to be able to recreate our CREATE, INSERT and SELECT statements:

DECLARE @CreateSQLCols VARCHAR(MAX)
	,@InsertSQLCols VARCHAR(MAX)
	,@SelectSQLCols VARCHAR(MAX)

;WITH cte_SQL AS
(
SELECT
		sep = IIF(SortOrder > 1, ',', '')
		,ColName
		,ColType
		,NullSQL = IIF(IsNullable = 1, 'NULL', 'NOT NULL')
		,SortOrder
		,MaxColNameLen = MAX(LEN(ColName)) OVER () -- For our nicely-tabbed SELECT statement
	FROM #ColDefs
)
SELECT
	@CreateSQLCols	= STRING_AGG(sep + ColName + ' ' + ColType + ' ' + NullSQL, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
	,@InsertSQLCols	= STRING_AGG(sep + ColName, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
	,@SelectSQLCols	= STRING_AGG(sep + ColName + REPLICATE(CHAR(9), 1 + (MaxColNameLen - LEN(ColName)+1)/4) + '= x.' + ColName, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
FROM cte_SQL

SELECT CreateSQLCols = @CreateSQLCols, InsertSQLCols = @InsertSQLCols, SelectSQLCols = @SelectSQLCols, ColListSQL = REPLACE(@InsertSQLCols, CHAR(13) + CHAR(10), '')

which looks like this in the Results pane:

We can now just cut’n’paste into our query window. CreateSQLCols is:

[ID] INT NOT NULL
,[Var_Int08] TINYINT NULL
,[Var_Int16] SMALLINT NULL
,[Var_Int32] INT NULL
,[Var_Int64] BIGINT NULL
,[Var_Float24] FLOAT(24) NULL
,[Var_Float53] FLOAT(53) NULL
,[Var_Decimal1801] DECIMAL(18,1) NULL
,[Var_Decimal3810] DECIMAL(38,10) NULL
,[Var_Date] DATE NULL
,[Var_DateTime] DATETIME NULL
,[Var_DateTime2] DATETIME2 NULL
,[Var_Char10] CHAR(10) NULL
,[Var_Varchar20] VARCHAR(20) NULL
,[Var_Nvarchar40] NVARCHAR(40) NULL
,[Var_SmallMoney] SMALLMONEY NULL
,[Var_Money] MONEY NULL

InsertSQLCols is the same list but just the column names. SelectSQLCols is the same, but with an assignment (e.g. [Var_Int32] = x.[Var_Int32]) so we can make changes to individual columns if we need to — note also we’ve calculated the number of tabs needed, so the columns line up, which makes it easier to read. ColListSQL is the same as InsertSQLCols, but all the columns are on one line (still comma-separated).

If we want to, we can build the whole CREATE statement (it’s officially called DDL, Data Definition Language):

SELECT CreateSQL = 'CREATE TABLE dbo.MyTable' + CHAR(13) + CHAR(10) + '(' + CHAR(13) + CHAR(10) + CHAR(9) + REPLACE(@CreateSQLCols, CHAR(13) + CHAR(10), CHAR(13) + CHAR(10) +  CHAR(9)) + CHAR(13) + CHAR(10) + ')' + CHAR(13) + CHAR(10)

NB: CHAR(13) + CHAR(10) is a carriage return + line feed, it moves us on to a new line. CHAR(9) is a tab.

As you can hopefully see, when you have hundreds (and hundreds and hundreds…) of columns to deal with, it’s much easier to wrangle them in code, rather than type SQL out by hand and risk mistakes. If you’re not used to this way of writing SQL, it can seem daunting, but you quickly get the hang of it — and it’s a huge time-saver.


What’s going on with Var_Int64?

Finally, we come back to the issue with Var_Int64, it doesn’t look as random as the other columns.

Using NEWID() as a basis to generate other random data is pretty common; most of the time, we do something like this:

SELECT ABS(CAST(CAST(NEWID() AS BINARY(16)) AS INT)) % 20
-- or even:
SELECT ABS(CHECKSUM(NEWID())) % 20

which gives us a random integer between 0 and 19. (See CHECKSUM (Transact-SQL).) Now, while NEWID might give us a GUID that is random ‘overall’, not every part of it is wholly random.

Below is part of a recordset consisting of some randomly-generated GUIDs (using NEWID), the GUID converted to binary, then that binary converted to an integer:

guid                                 CAST(guid AS BINARY(16))           CAST([2] AS INT)
------------------------------------ ---------------------------------- --------------------
1F392DA2-9DEE-40CE-ACF5-0F70A1FB76E3 0xA22D391FEE9DCE40ACF50F70A1FB76E3 -5983859553463470365
30225F93-0C00-4A51-A63B-F84A0EE84177 0x935F2230000C514AA63BF84A0EE84177 -6468303442826215049
B7ED94A5-1274-4E96-B736-5E28C69B9BFF 0xA594EDB77412964EB7365E28C69B9BFF -5244901186802574337
753E907A-C500-4D83-879F-A76872407566 0x7A903E7500C5834D879FA76872407566 -8674030290257021594
D4C732E8-59DD-49D7-AF70-090F74A32DC3 0xE832C7D4DD59D749AF70090F74A32DC3 -5805129957694558781
D2855FC4-637B-45BA-BCB0-7BC16EA37648 0xC45F85D27B63BA45BCB07BC16EA37648 -4850240727962913208
5057A49D-0275-467A-A44D-81C3552D51A2 0x9DA4575075027A46A44D81C3552D51A2 -6607482402335010398
7916A696-F878-4245-A192-76CDC032306F 0x96A6167978F84542A19276CDC032306F -6804245460938510225
2D528B2C-574B-459A-8456-75B2A6B3CA03 0x2C8B522D4B579A45845675B2A6B3CA03 -8910805402544518653
130A63EE-9060-4434-8B3C-D05E3BFE1805 0xEE630A13609034448B3CD05E3BFE1805 -8413620900682917883

First, notice how the GUID is converted into binary: the last 16 digits are in the same order, but for the first 16, they are reversed within each block. Now look at the 3rd block of digits in the GUIDs — they all begin with a ‘4’. And the first digit in the 4th block doesn’t range much either. In fact, I generated a million GUIDs, and for all of them, the 13th digit was ‘4’, and the 17th digit only took the values ‘8’,’9′,’A’ and ‘B’. Now, I’m not going to pretend I fully understand what’s going on here, but from the wiki entry (Universally unique identifier), it looks like the GUIDs are being generated using the current date/time, and my MAC address (network address), so maybe it’s not surprising that some digits have restricted ranges.

Let’s see what happens when we convert some BIGINT values to a BINARY(16):

BIGINT                BINARY(16)
--------------------  ----------------------------------
-9223372036854775808  0x00000000000000008000000000000000  -- smallest possible value of BIGINT
         -2147483648  0x0000000000000000FFFFFFFF80000000  -- smallest possible value of INT 
              -32768  0x0000000000000000FFFFFFFFFFFF8000  -- smallest possible value of SMALLINT
                  -2  0x0000000000000000FFFFFFFFFFFFFFFE
                  -1  0x0000000000000000FFFFFFFFFFFFFFFF
                   0  0x00000000000000000000000000000000
                   1  0x00000000000000000000000000000001
                   2  0x00000000000000000000000000000002
               32767  0x00000000000000000000000000007FFF  -- largest possible value of SMALLINT
          2147483647  0x0000000000000000000000007FFFFFFF  -- largest possible value of INT
 9223372036854775807  0x00000000000000007FFFFFFFFFFFFFFF  -- largest possible value of BIGINT

A BIGINT is only 8 bytes, so when we convert from BINARY(16), the first 8 bytes (16 hex digits) get ignored; our final number is taken from the 17th to 32nd digits of the GUID only. As we’ve seen from the above, the 17th digit only ranges from ‘8’ to ‘A’ — and due to the way signed integers are stored, this means the number will always be negative.

The lesson here is: however you create your random data in SQL, check that it really is random, especially if you’re using it for statistical purposes!

, , , , , ,

Leave a comment

Scorecards versus Neural Networks

First, some explanatory points:

  • More often than not, scorecards are just scaled logistic regression models, if that makes this post easier to follow.
  • By ‘ML’, I mean the family of modern machine learning algorithms – neural networks (NNs), random forests, support vector machines, etc.
  • What I’ve written here has a big caveat: This happens in my experience. If you’ve experienced differently, please do let me know!

Introduction

I must apologise: the title, ‘Scorecards versus Neural Networks’, is sort of clickbait-y; these two things are related, but not the same — it’s a bit like saying ‘Bicycles versus Prams’, or ‘Pasta versus Bread’. The point of this blog post is that in my experience the people who build these models have different mindsets, which ⚠ spoiler ⚠ leads to scorecards having better results (in certain domains) than ML models.

Of course, neural networks can do things that a scorecard could never do — image recognition, for example — but there are good model-building practices that should apply whatever flavour of model you’re creating.

Scorecards

I’ve worked for a number of consumer lending businesses, and all of them used (one or more) scorecards as their primary lending decision mechanism. This isn’t just because I worked there, and scorecards are my ‘thing’; all the heads of Credit Risk I’ve worked with wanted to use scorecards. Scorecards are an old-fashioned technology, but they’re very easy to understand; they’re not a ‘black box’ like more modern ML techniques. Building a good scorecard is somewhere between an art and a science, but understanding it, what it’s doing and why it arrives at its final score, couldn’t be simpler: it’s just a case of adding up points for positive characteristics, taking away points for negative. An increase in a variable relating to a positive characteristic, leaving all other variables alone, gives you an increase in score, and vice-versa*. Due to this simple nature, scorecards are trivial to code ‘in production’.

However, neural networks and other ML models tend to be highly non-linear, and there’s absolutely no guarantee that changing one input variable will give you the simple increase or decrease in score you were expecting. And they are definitely not trivial to implement.

* This isn’t necessarily true if you build bad scorecards, but I try not to build bad scorecards 😉


Machine learning getting it wrong

Here are a few instances of machine learning models not behaving as expected; they’re not in the credit risk domain, but that doesn’t matter, there are principles in common.

1. “‘Typographic attack’: pen and paper fool AI into thinking apple is an iPod” (2021)

But even cleverest AI can be fooled with the simplest of hacks. If you write out the word “iPod” on a sticky label and paste it over the apple, Clip does something odd: it decides, with near certainty, that it is looking at a mid-00s piece of consumer electronics [rather than an apple]. In another test, pasting dollar signs over a picture of a dog caused it to be recognised as a piggy bank.

https://www.theguardian.com/technology/2021/mar/08/typographic-attack-pen-paper-fool-ai-thinking-apple-ipod-clip

2. “Amazon scraps secret AI recruiting tool that showed bias against women” (2018)

But by 2015, the company realized its new system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way. That is because Amazon’s computer models were trained to vet applicants by observing patterns in resumes submitted to the company over a 10-year period. Most came from men, a reflection of male dominance across the tech industry. In effect, Amazon’s system taught itself that male candidates were preferable.

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

3. “A bookshelf in your job screening video makes you more hirable to AI” (2021)

The addition of art or a bookshelf in the background made an Asian test subject seem much more conscientious and significantly less neurotic compared to the same faux applicant in front of a plain background.

https://www.inputmag.com/culture/a-bookshelf-in-your-job-screening-video-makes-you-more-hirable-to-ai

4. “AI image recognition fooled by single pixel change” (2017)

Computers can be fooled into thinking a picture of a taxi is a dog just by changing one pixel, suggests research.

https://www.bbc.com/news/technology-41845878

5. “Twitter apologises for ‘racist’ image-cropping algorithm” (2020)

Twitter has apologised for a “racist” image cropping algorithm, after users discovered the feature was automatically focusing on white faces over black ones. The company says it had tested the service for bias before it started using it, but now accepts that it didn’t go far enough.

https://www.theguardian.com/technology/2020/sep/21/twitter-apologises-for-racist-image-cropping-algorithm


From these stories, you could be forgiven for concluding that ML models aren’t all that they’re cracked up to be. Actually, I rather think the stories point to one or more of the following issues:

  • A misunderstanding of the problem domain
  • Inadequate or incorrect training data
  • A lack of testing, post-model build
  • Inadequate monitoring of models in production
  • Confusion about the limitations of models

Scorecards getting it wrong?

The above stories are all high-profile, because ‘AI’ is news. Scorecards aren’t news, they’re niche; and if a company lends to the wrong people due to a fault with the scorecards, the general public doesn’t get to hear about it.

Scorecards can suffer from each of the five issues I itemised above, in exactly the same way as ML. But scorecard builders will likely spend a lot more time ensuring their scorecard is valid.

I don’t really know of any instances of scorecards going wrong, other than data being incorrectly coded; it’s unfortunately quite common for missing data (blanks, NULLs) or extreme (‘out-of-bounds’) values to get coded as the ‘default’ modal value. There’s an apocryphal story about a meaningless random variable making it all the way through to production, but I think the story exists to scare junior scorecard builders. 😉

But really, a scorecard shouldn’t go live without being thoroughly (and provably) tested. It’s a serious business, and there’s often a great deal of money at stake.


What are neural networks actually doing?

The channel 3Blue1Brown (run by mathematician Grant Sanderson) has a series of ‘Deep learning’ videos that explain neural networks from the ground up. The first video, But what is a Neural Network?, describes a dataset of handwritten digits, scanned and discretised, that are used as example data to build a neural network. In fact, the classification rate is very impressive: the neural network classifies 98% of digits correctly.

In the second video, Gradient descent, how neural networks learn, we’re shown pictorially (at around the 14 minute mark) what the network is doing – it’s not picking out lines and loops and shapes common to numbers, as you might expect, but:

“[I]t would seem that in the unfathomably large 13,000 dimensional space of possible weights and biases, our network found itself a happy little local minimum that, despite successfully classifying most images, doesn’t exactly pick up on the patterns that we might have hoped for…

To human eyes, the patterns don’t look like anything at all, certainly nothing resembling numbers. And yet the NN is working at an accuracy rate of 98%! Even more disturbing:

… and to really drive this point home, watch what happens when you input a random image; if the system is smart, you might expect it to either feel uncertain, maybe not really activating any of those 10 output neurons, or activating them all evenly, but instead it confidently gives you some nonsense answer

So, with regards to this particular neural network:

  • You can’t tell how it works — you only know that it does.
  • If you put nonsense data in, you get a real (but totally irrelevant) answer out

As is pointed out in the videos, the network used for this model is somewhat old-fashioned, there are more modern alternatives.


A couple of anecdotes (not quite ‘case studies’) from my own personal experience:

Anecdote #1: Scorecard vs. neural networks in consumer lending

At one of the businesses I worked for, we trialled a neural network model to see if it could out-perform our recently-built scorecards. A well-known company (in the decision science arena) took our data and used their flagship tool to build a neural net model. Their whole process was web-based, end to end, and the interface (GUI/UX) was beyond impressive. They presented their model, which had a gini value 5 percentage points higher than our scorecard; that is, more predictive, and enough so that the savings made by better predictions were greater than the cost of using their software. [NB: The gini is just a classification score, from 0% (no better than randomly guessing) to 100% (perfect answer every time).]

However: this initial model used hundreds of variables (if I recall correctly, around 300), as opposed to our scorecard’s 20 or so. A very important part of managing risk is understanding the variables that make up a model, how they might vary over time, and their impact on the result. Because of their approach of using all the data available without question, their model had included a variable that was actually removed half-way through the lifetime of the build population, and wasn’t available in production. Once this variable was removed, and the number of variables reduced to under 100, their model gini was only marginally higher than our scorecard’s. Taking into account the operating cost of the model, there was no financial gain to be had from using the more complicated model.

I subsequently worked with another lender (in this case, B2B) who used the same decision science company’s product to provide a credit-worthiness score – it was one of several scores available to underwriters, all the loans were manually assessed. The lead underwriter said that in her opinion, the score was very good, but would occasionally give scores that didn’t make sense alongside their other data. This meant underwriters began distrusting the neural network score, and because there was no way of understanding WHY the score was how it was, it ended up unused in the overall decision-making process, even though they were still paying for it for each application.

For me, the lesson here is not that the model (or software) was in any way faulty – it did exactly what was being asked of it. But there was a lack of attention to detail with regard to (a) the input data, and (b) how the models were going to be used in practice.

Anecdote #2: Scorecard vs. ML models in B2B marketing

I was very recently involved in a marketing project that pitted a scorecard against a random forest model, built by a third-party company (not the same company as in the previous section). The dataset comprised several thousand companies, with some basic data (e.g. location, size, industry classification code), and an outcome indicating whether previous marketing had been successful. For new companies, could we predict whether marketing to them would be worth the cost?

Apart from the different domain, it was a very similar exercise to the above. The main difference was in the lack of ‘per company’ data; for lending models, you typically have access to thousands of variables from a credit file. Instead, we had to get creative, and derive our own variables from the limited amount of information we had.

The third-party had gone through several iterations of models — including neural networks, support vectors machines (SVM) and logistic regression, among others — and settled upon a random forest as giving the best results.

In the end, despite taking more than four times as many person-days to build, when used in production, their random forest model had a slightly lower gini than our scorecard — it wasn’t as good. In my post-project documentation, I put this down to a lack of creativity at the variable creation stage; during our build, we’d managed to generate dozens of new variables from our existing data, a few of which were very predictive.

For full disclosure, I should mention that we subsequently ran both models over a dataset that the models weren’t built on (that is: known companies we wanted to re-approach, not ‘first-time’ marketing), just to see what would happen: the random forest did pretty well, the scorecard was awful! Unfortunately, there was no time allotted to investigate why we saw these results, but it shows that there was a robustness to the ML model that we weren’t expecting.


End notes

I think my central point is fairly obvious, but to state it explicitly: a simple model built by a domain expert will be (probably much) better than a complicated model built by an expert in building complicated models. To illustrate the difference: for consumer lending scorecards I’ve built, the timeline was divided up approximately like this:

Stage Time taken
Data gather (including cleaning) 20%
Variable assessment (including creating new ones) 25%
Scorecard build 15%
Checking, validating and documenting the model 40%

But when third-party companies have built models, their timeline was more like this:

Stage Time taken
Data gather (including cleaning) 5%
Variable assessment (including creating new ones) 5%
Model build 80%
Checking, validating and documenting the model 10%

Nearly all the time is put into the model build — e.g. trying the different algorithms (neural net vs random forest vs logisitic regresion …), tuning hyper-parameters, etc., in order to get a good classification rate.

But building a good model is about so much more than the overall classification rate, or the algorithm you choose. There’s a huge chunk of work to be done, pre- and post- model build. If that gets ignored, then the selected model could be a poor one. I’m sure the third-party companies were perfectly capable of doing the extra work; but it has to be communicated, costed, and agreed upon.


A final couple of points:

, , , , ,

3 Comments

Turning integers into rational factorial expressions

Caution! Maths up ahead!

I have a long-term side project that involves dealing with rational factorial expressions, that reduce to (often fairly large) integers, e.g.:

\frac{20! 18! 11!}{15! 12! 9!} = 2735405697024000

Typically, I’ll start with an integer (a result of a sum of combinatorial expressions), and require to turn it back into a purely factorial expression. My initial approach was to create a database of these integers: loop over all the possible permutations of factorials, and store the results. Here’s a snippet of the data:

(17!.17!.9!.7!)/(13!.7!.1!) = 7372584295195056537600000
(17!.17!.9!.7!)/(13!.7!.2!) = 3686292147597528268800000
(17!.17!.9!.7!)/(13!.7!.3!) = 1228764049199176089600000
(17!.17!.9!.7!)/(13!.7!.4!) = 307191012299794022400000
(17!.17!.9!.7!)/(13!.7!.5!) = 61438202459958804480000
(17!.17!.9!.7!)/(13!.7!.6!) = 10239700409993134080000

As you’d expect, this database got huge very quickly; this one particular formulation (a product of four factorials over a product of three factorials, using values up to 20) results in a table of 13 million rows, 643 MB in size. Although it works in principle, it’s not terribly practical.

Now, from our O-level Maths days, we know that all integers can be factorised into powers of primes, e.g.:

2735405697024000 = 2^{14} \times 3^{5} \times 5^{3} \times 7^{1} \times 11^{1} \times 13^{1} \times 17^{2} \times 19^{1}

The statement that every integer greater than 1 is either a prime, or a product of primes, is known as the Fundamental theorem of arithmetic, due to Gauss. For the above integer, I calculated the exponents in R using the factorize function from the gmp package, which is used for manipulating big integers and rationals of any size.

table(as.integer(factorize(2735405697024000)))
 2  3  5  7 11 13 17 19 
14  5  3  1  1  1  2  1 

If we factorise our six factorials in the first identity similarly, and tabulate them nicely, we get:

            (power of)
N!      2  3  5  7 11 13 17 19 
------------------------------
20! =  18  8  4  2  1  1  1  1 
18! =  16  8  3  2  1  1  1  0
11! =   8  4  2  1  1  0  0  0
------------------------------
15! =  11  6  3  2  1  1  0  0
12! =  10  5  2  1  1  0  0  0
 9! =   7  4  1  1  0  0  0  0
------------------------------
ans.=  14  5  3  1  1  1  2  1

For each power, the sums of the columns in the top block (the numerator of the LHS of the identity), minus those in the bottom block (denominator), give the power in the answer, e.g. in the 2 column, (18+16+8) – (11+10+7) = 14, which is the power of 2 in the result.

So if we want to turn an integer into a rational factorial expression, we just need a way of determining which linear combinations of the powers of our factorized factorials will give us the same powers as our target integer. For this, we need to turn our prime exponents into a matrix, and compute the Reduced row echelon form; as ever, there’s an existing R package that contains a function we can use!

Worked example

In order to demonstrate the process, we’ll keep the numbers small. Let’s say we have an integer, 120960, and we want to know if it can be turned into a rational factorial expression — it may not be possible.

120960 = 2^{7} \times 3^{3} \times 5^{1} \times 7^{1}

Here’s our table of factorial factors (which I’ve called factpp), and above it, the code I used to generate it:

require(gmp); # gmp library
maxprime <- 7;
pm <- primes(maxprime); # 2,3,5,7
npm <- length(pm);
factpp <- matrix(rep(0,(maxprime-1)*npm),nrow=(maxprime-1),ncol=npm); # factorial prime powers
for(i in 1:(maxprime-1))
{
	tv <- table(as.integer(factorize(factorial(i+1))));
	for(j in 1:npm) {
		snpm <- sprintf("%d",pm[j]);
		if (snpm %in% names(tv)) {
			factpp[i,j] <- tv[snpm];
		}
	}
}

factpp
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    1    1    0    0
[3,]    3    1    0    0
[4,]    3    1    1    0
[5,]    4    2    1    0
[6,]    4    2    1    1

(Note how the exponents are non-decreasing – a proof of this can be found here: Prime Factors of Factorial Numbers)

My code is pretty horrible (matching by converting ints to strings, yuck), and it can definitely be hugely improved – but I only need it once; when I’ve generated the prime factors for all the factorials I’m interested in, I don’t need to run it again.

We can check that the exponents make sense:

apply(factpp, 1, function(v) { (2^v[1])*(3^v[2])*(5^v[3])*(7^v[4])});
[1]    2    6   24  120  720 5040

These are clearly the values of 2! up to 7!. (We can ignore 1! = 1, we don’t need it.) Because the highest prime factor in 120960 is 7, we don’t need to consider factorials that have prime factors greater than 7.

To compute our reduced row echelon form, we require the function rref in the package pracma (see here for documentation).

require(pracma);
input <- cbind(t(factpp), c(7,3,1,1)); # bind factorial exponents to target exponents 
rref(input);
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    2    0    0    0    2
[2,]    0    1    1    0    1    0    1
[3,]    0    0    0    1    1    0    0
[4,]    0    0    0    0    0    1    1

This matrix is actually giving us a system of simultaneous equations, one equation per row; the values in the first 6 columns are the coefficients of the variables, and the final column on the right is the value on the right-hand side of the equals sign. To interpret this matrix (i.e. the output of rref), this video Interpreting RREF Augmented Matrices by MATHguide is succinct and useful.

With our reduced row echelon form, there are three possibilities:

  • There is exactly one solution
  • There are no solutions
  • There are infinitely many solutions

Because we have 4 equations and 6 variables, we know that we don’t have a single solution. From the above MATHguide video, we can see that the output of rref indicates we have infinitely many solutions – but that’s ok, it’s not going to cause us a problem.

From the output, our simultaneous equations can be written as:

a + 2c = 2
b + c + e = 1
d + e = 0
f = 1

which we can reduce to the following result vector with two variables:

(2-2c, 1-c-e, c, -e, e, 1)

So, for any integer values of c and e, we will get a valid result. The values of this vector indicate the power of the factorial in our rational expression. Let’s look at some examples:

 c   e  vector                  resulting factorial expression
--  --  ----------------------  ----------------------------------
 0   0  (2,  1,  0,  0, 0, 1)   2!2!3!7!
 1   0  (0,  0,  1,  0, 0, 1)   4!7!
 0   1  (2,  0,  0, -1, 1, 1)   (2!2!6!7!) / 5!
 1   1  (0, -1,  1, -1, 1, 1)   (4!6!7!) / (3!5!)
-2   2  (6,  1, -2, -2, 2, 1)   (2!2!2!2!2!2!3!6!7!) / (4!4!5!5!)

All of these are valid solutions (they all equal 120960), and we can choose whichever one best fits our needs.

(Why do we have more than one solution? Because simple variations of factorial products have the same values, e.g. 2!2!3! = 4! and 6!7! = 10!. If we removed rows from factpp — e.g. we removed the first row, relating to 2! — then we’d get fewer possible results.)

, , , , ,

Leave a comment

Practical queries using binary operations

One of my ongoing ‘fun’ projects is a Sudoku-solver, written using only SQL. I’m not interested in brute-forcing a solution — there already exists a beautiful (and tiny) piece of SQL to do that: see Using Common Table Expressions to solve Sudoku in Microsoft SQL Server — rather, I’m trying to replicate the way I’d solve a Sudoku with a pen and paper.

Some parts of my code involve representing the combination of possible values of each empty Sudoku cell as a single integer, by using a single bit (‘binary digit’) for each of the values 1 to 9. This makes some of the coding much easier than it otherwise would be. Encoding binary information as integers, and then querying that binary data, isn’t something that’s widely done within SQL, but it’s occasionally the nicest* way of attacking a problem.

* For some value of ‘nice’, please argue amongst yourselves(!)


Very quick refresher: binary

(If you know about binary, you can skip this section.)

A binary representation of a number uses just the two digits zero and one. For a string of binary digits, the nth digit from the right represents the decimal value 2^(n-1) — where in standard mathematical notation, the caret character ‘^’ means ‘to the power of’. Annoyingly, as we’ll see shortly, ‘^’ means something different in SQL. Two to the power of (n-1) in SQL is calculated with the function POWER(2,n-1).

Example: 110 in binary is equivalent to 1 * POWER(2,2) + 1 * POWER(2,1) + 0 * POWER(2, 0) = 6 in decimal. Note that we didn’t need to consider the last part, 0 * POWER(2, 0), as zero times anything is zero. Accordingly, 1000001 is POWER(2,6) + POWER(2,0) = 64 + 1 = 65.

To manipulate binary numbers, there are standard ‘logical operators’ AND, OR, EOR that take two binary numbers as input and return a binary number as output; see Bitwise operation at Wikipedia for more detail. The operator NOT takes one input and simply ‘flips’ the bits such that 0 becomes 1, and 1 becomes 0. We don’t have to explicitly convert our numbers to binary to use these operators, as integers are already stored as binary numbers in memory.

If you want to convert between binary (aka ‘base 2’) and decimal (‘base 10’), you can use the functions from my previous post, Encrypting IDs using a custom alphabet:

SELECT dbo.fn_BaseEncrypt(65, '01')
1000001
SELECT dbo.fn_BaseDecrypt('1000001', '01')
65


(Non-Sudoku) Example

We have 7 employees (identified by EmployeeID) that work various days of the week. We represent the data in two different datasets: in the first, the data is normalised, and each row contains the EmployeeID, and a Day value to represent which day that employee works: 1 = Sunday, 2 = Monday, …, 7 = Saturday; one row per day. In the second dataset, there’s only one row per employee, and a single integer indicates the combination of days: the 1st (right-most) bit is ‘1’ if they work Sunday, ‘0’ if not; the 2nd bit is ‘1’ if they work Monday, ‘0’ if not; etc, up to the 7th bit representing Saturday. This integer can take all values from 0 (no days worked) to 127 (all days worked).

Using our datasets, we want to answer some straightforward questions:

  1. Which employees work on Wednesday?
  2. Which employees work on Monday and Wednesday?
  3. Which employees work on Wednesday but not Monday?
  4. Which employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?
  5. Which employees work 5 or more days in a week?

The way in which these questions are answered using the two different datasets (normalised and binary) will demonstrate how using binary data can sometimes lead to simpler code.


Normalised data

Here’s our normalised data, one record per EmployeeID / Day:

DROP TABLE IF EXISTS #DaysWorked
GO

CREATE TABLE #DaysWorked
(
	EmployeeID INT NOT NULL
	,[Day] TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID, [Day])
)
GO

INSERT #DaysWorked(EmployeeID, [Day])
	VALUES
	 (10,2),(10,3),(10,4),(10,5),(10,6)		-- Mon to Fri
	,(20,2),(20,3),(20,4),(20,5)			-- Mon to Thu
	,(30,6),(30,7),(30,1)				-- Fri, Sat, Sun
	,(40,3),(40,4),(40,5)				-- Tue, Wed, Thu
	,(50,2),(50,3),(50,4),(50,5),(50,6),(50,7)	-- Mon to Sat
	,(60,3),(60,4),(60,5),(60,6),(60,7)		-- Tue to Sat
	,(70,1)						-- Sun only
GO

Binary data

Here’s the same data, but using single a integer per employee, DayBits, to represent the collection of days worked:

DROP TABLE IF EXISTS #DaysWorkedBinary
GO

CREATE TABLE #DaysWorkedBinary
(
	EmployeeID INT NOT NULL
	,DayBits TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID)
	,INDEX IX_DaysWorkedBinary_DayBits (DayBits)
)
GO

INSERT #DaysWorkedBinary(EmployeeID, DayBits)
	VALUES		-- SFTWTMS
	 (10, 62)	-- 0111110
	,(20, 30)	-- 0011110
	,(30, 97)	-- 1100001
	,(40, 28)	-- 0011100
	,(50,126)	-- 1111110
	,(60,124)	-- 1111100
	,(70,  1)	-- 0000001
GO

As you can see from the comments in green, each day corresponds to a 0/1 bit, where the values of those bits are the powers of 2: from the right, 1,2,4,8,16,32 and 64.

For the sake of completeness — just how did we calculate the values of the DayBits field? Like this:

SELECT
		d.EmployeeID
		,DayBits = SUM(POWER(2, d.[Day]-1))
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	ORDER BY d.EmployeeID
GO

To go back the other way, turning the binary data into normalised, one row per employee/day:

SELECT
		d.EmployeeID
		,[Day] = v.n
	FROM #DaysWorkedBinary d
	JOIN (
		VALUES(1),(2),(3),(4),(5),(6),(7)
	) v(n)
		ON d.DayBits & POWER(2, v.n-1) > 0 
	ORDER BY d.EmployeeID, [Day]
GO

So, we have our two datasets (normalised data: #DaysWorked, and binary data: #DaysWorkedBinary), and we can start to answer our five questions.


Questions 1, 2 and 3

Normalised version

Answering question 1, “Which employees work on Wednesday?” is trivial:

SELECT EmployeeID
	FROM #DaysWorked
	WHERE [Day] = 4
	ORDER BY EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

For question 2, “Which employees work on Monday and Wednesday?”, we need to work a little harder:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
	WHERE d1.[Day] = 4 
	AND d2.[Day] = 2
	ORDER BY d1.EmployeeID
GO
EmployeeID
-----------
10
20
50

We can write this query in other ways:

-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day] IN (2,4),1,0)) = 2
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 1 AND [4] = 1
	ORDER BY EmployeeID

All four queries return the same result set. Note that a and b have identical execution plans; the plans for c and d aren’t identical, but they’re very similar to each other.

Question 3, “Which employees work on Wednesday but not Monday?”, is similar in construction to question 2:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	LEFT JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
		AND d2.[Day] = 2
	WHERE d1.[Day] = 4 
	AND d2.EmployeeID IS NULL
	ORDER BY d1.EmployeeID
-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID NOT IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day]=4,1,0)) = 1
	AND SUM(IIF(d.[Day]=2,1,0)) = 0
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 0 AND [4] = 1
	ORDER BY EmployeeID

Here, the execution plans for a and b are very similar, but not identical.

Binary version

Now we’ll answer the same questions, but using our binary dataset.

Question 1:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 8 = 8
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

The same results as before; the only real structural difference is in the WHERE clause. We’re using ‘&’ (AND) to pull out the records that match. The 8 is the 4th bit from the right (1,2,4,8), i.e. Wednesday. Our query returns those records for which the DayBits integer has the 4th bit ‘set’ (or ‘switched on’).

Question 2:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 10
	ORDER BY d.EmployeeID

Same premise as before, but we’re checking whether two bits (the 2nd and 4th, corresponding to Monday and Wednesday) are set.

Question 3:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 8 
	ORDER BY d.EmployeeID

Similar structure, but here we only want those records where the 4th bit is set (equal to 1), and the 2nd is unset (equal to zero).


Question 4

This is a trickier question: “Which pairs of employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?”.

Here’s my solution:

Normalised version

;WITH cte_CrossJoin AS
(
	SELECT
			EmployeeID1	= d1.EmployeeID 
			,EmployeeID2	= d2.EmployeeID
			,Emp1Days	= COUNT(DISTINCT d1.[Day])
			,Emp2Days	= COUNT(DISTINCT d2.[Day])
			,[Matched]	= SUM(IIF(d2.[Day] = d1.[Day], 1, 0))
		FROM #DaysWorked d1
		CROSS JOIN #DaysWorked d2
		WHERE d2.EmployeeID > d1.EmployeeID
		GROUP BY d1.EmployeeID , d2.EmployeeID
)
	SELECT EmployeeID1, EmployeeID2
		FROM cte_CrossJoin
		WHERE Emp1Days + Emp2Days = 7
		AND [Matched] = 0
GO
EmployeeID1 EmployeeID2
----------- -----------
20          30
50          70

(I have to admit, I went through several different attempts — this is number 4 or 5. It’s reasonably compact, but due to the CROSS JOIN, it’s more ‘brute-force’ than I’d like. If you have a more refined solution, please let me know!)

Binary version

The binary version is simpler:

SELECT
		Employee1 = d1.EmployeeID
		,Employee2 = d2.EmployeeID
	FROM #DaysWorkedBinary d1
	JOIN #DaysWorkedBinary d2
		ON d2.EmployeeID > d1.EmployeeID
	WHERE d1.DayBits ^ d2.DayBits = 127
GO

In SQL, the ‘^’ (caret) character means ‘EOR’, ‘exclusive OR’ (sometimes seen as ‘XOR’). The bits representing the days have to be different on each side, but they have to exist on at least one.


Question 5

The final question is “Which employees work 5 or more days in a week?”. This is trivial using the normalised data; and given that, so far, all the queries using the binary data are smaller, you’d expect that to be the case here too.

Normalised version

Very straightforward, we use a HAVING clause to restrict the number of records (in our case, employee/days) to greater than or equal to 5:

SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING COUNT(1) >= 5
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
50
60

Binary version

Sadly, the binary version isn’t particularly neat. This is one of the limitations of using binary data: there’s no native function to count how many bits are ‘on’ (set) in a given integer. (If there was, it would again be trivial.) We have to do the counting ourselves; first, a longhand version:

;;WITH cte_BitCount AS
(
	SELECT
			EmployeeID
			,BitCount =	  IIF(DayBits&1>0,1,0)  + IIF(DayBits&2>0,1,0)
					+ IIF(DayBits&4>0,1,0)  + IIF(DayBits&8>0,1,0)
					+ IIF(DayBits&16>0,1,0) + IIF(DayBits&32>0,1,0)
					+ IIF(DayBits&64>0,1,0) 
		FROM #DaysWorkedBinary
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID

GO

And here’s a more extensible way of doing the same calculation:

;;WITH cte_BitCount AS
(
	SELECT
			d.EmployeeID
			,BitCount = SUM(SIGN(d.DayBits & POWER(2,v.n-1)))
		FROM #DaysWorkedBinary d
		CROSS APPLY (
			VALUES(1),(2),(3),(4),(5),(6),(7)
		) v(n)
		GROUP BY d.EmployeeID
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID
GO	

but it’s still not as concise as the query over the normalised data.

Some further reading on counting bits:

And finally

I hope I’ve demonstrated that there are times when using binary / bit-wise logic makes your code more elegant and compact. It’s definitely useful in cases where there are fixed numbers of discrete ‘flags’ — units of time like days and months being good examples, as they’re never going to change — but you have to be careful not to implement these patterns without some consideration. For one, they may not be particularly friendly to colleagues who don’t have a maths/CS background. You can always swap between normalised/binary ‘on the fly’ (as per the queries near the top that convert one dataset into the other) if you need to.

There’s some more to be written about aggregating over binary data, as there are no in-built SQL functions for this; hopefully that’ll be a future post.

, , ,

Leave a comment

Encrypting IDs using a custom alphabet

Let’s say you want to track link clicks in a marketing email you’re sending out. The best way to track these is to put a GUID in all the links, so every click is uniquely identifiable. In order for this to work, you’ll need to have a record of all the GUIDs, stored alongside the destination email address (and/or customer ID). But if you’re sending many emails to millions of customers, that’s a lot of GUIDs to store.

If you’re not too worried about people fiddling with the links, you could encrypt an ID (e.g. CustomerID), and put that in the link instead of a GUID; that means you wouldn’t have to keep any record in your database, you could just decrypt the ID string next time you saw it. So let’s look at a way in which we can encrypt IDs… but, just to be clear, this is quite weak encryption, you shouldn’t use it for anything important. Only consider it if the worst thing that can happen as a result of someone cracking your code is mild inconvenience!

One way to encrypt an integer ID is to change its base, and use a randomised alphabet. Here’s the function I’m going to use:

CREATE OR ALTER FUNCTION dbo.fn_BaseEncrypt
(
	@base10 INT
	,@alphabet VARCHAR(255)
)
RETURNS VARCHAR(255)
AS
BEGIN

	DECLARE @newBaseLen INT = LEN(@alphabet)

	DECLARE @remaining INT, @result VARCHAR(255) = ''

	SET @remaining = @base10

	WHILE (@remaining > 0)
	BEGIN
		SET @result = SUBSTRING(@alphabet, 1 + (@remaining % @newBaseLen), 1) + @result
		SET @remaining = @remaining / @newBaseLen
	END

	RETURN(@result)

END
GO

(NB: None of the code presented here is bullet-proof, production-ready code. For clarity, I’ve left out things like checking the input parameters.)

It’s reasonably simple: there’s a loop that repeatedly modulos (that’s the ‘%’ character) and divides the starting number, adding the relevant indexed character from our alphabet each time.

Some simple examples:

SELECT
		[Description]
		,Alphabet
		,[Value]
		,Result = dbo.fn_BaseEncrypt([Value], Alphabet)
	FROM 
	(
		VALUES
			('Binary', '01', 252)
			,('Trinary', '012', 253)
			,('Octal', '01234567', 254)
			,('Base 9', '012345678', 1234567)
			,('Hexadecimal', '0123456789ABCDEF', 98765432)
	) x([Description], Alphabet, [Value])
GO
Description Alphabet         Value       Result
----------- ---------------- ----------- ---------
Binary      01               252         11111100
Trinary     012              253         100101
Octal       01234567         254         376
Base 9      012345678        1234567     2281451
Hexadecimal 0123456789ABCDEF 98765432    5E30A78

Don’t take my word for it that these values are correct! For the hexadecimal one, you can type '98765432 in hex' straight into your address bar in Google Chrome. For the others, there are sites like ExtraConversion and RapidTables.

Here’s the accompanying decryption function:

CREATE OR ALTER FUNCTION dbo.fn_BaseDecrypt
(
	@strValue VARCHAR(255)
	,@alphabet VARCHAR(255)
)
RETURNS INT
AS
BEGIN
	DECLARE @newBaseLen INT = LEN(@alphabet)
	
	DECLARE @result INT = 0, @m INT = 1, @i INT, @l INT = LEN(@strValue)
	DECLARE @revStrValue VARCHAR(255) = REVERSE(@strValue)
	SET @i = @l
	WHILE(@i > 0)
	BEGIN
		SET @result = @result + @m * (CHARINDEX(SUBSTRING(@revstrValue, @l-@i+1, 1) , @alphabet) - 1)
		SET @m = @m * @newBaseLen
		SET @i = @i - 1
	END
	RETURN(@result)
END
GO

Now, clearly anyone can ‘decrypt’ our example strings, because the ‘alphabet’ is obvious in each case. But what if we choose a completely different set of characters, like A-Z and 0-9, randomised?

-- replace the VALUES(...):
VALUES
	 ('Random #1', '1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y', 123456789)
	,('Random #2', 'C9VGTJ34ZXHMKEWUA28DNP6FYRB', 123456789)
Description Alphabet                             Value       Result
----------- ------------------------------------ ----------- ------
Random #1   1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y 123456789   3L6VXI
Random #2   C9VGTJ34ZXHMKEWUA28DNP6FYRB          123456789   ZAZ3F8

For ‘Random #2’, I’ve removed the characters 0,O,I,1,7,L,5,S and Q. You might want to do this if the resulting string is likely to be read back to you over the phone.

Aside: quick way of generating a randomised alphabet, using R:

paste(sample(c(LETTERS, 0:9)),collapse="")

So, if someone was to see a string of characters like ‘3L6VXI’, it would be meaningless to them; they don’t have the original alphabet in order to decode it.

However, keeping with the ‘Random #2’ alphabet, what if we look at encryptions of consecutive values:

Value       Result
----------- ------
10001       EDM
10002       EDK
10003       EDE
10004       EDW
10005       EDU
10006       EDA
10007       ED2
10008       ED8
10009       EDD
10010       EDN
10011       EDP
10012       ED6
10013       EDF
10014       EDY
10015       EDR
10016       EDB
10017       ENC
10018       EN9
10019       ENV
10020       ENG

Given enough examples of an encrypted ID, any competent ‘hacker’ is going to be able to deduce the generating alphabet. Instead of using sequential IDs, you could use some easily reversible function: 11 * {ID} + 13, for example. But we can get more creative than that! Here are some ways you could further obfuscate the string:

  • Reverse it
  • Rotate it by n characters to the left/right
  • Re-order the string entirely (with a fixed re-ordering)
  • Add spacing characters, e.g. a hyphen, at meaningless positions
  • Add random characters at pre-defined positions
  • Add one or more check digits – also consider SQL’s CHECKSUM() function.
  • Use more than one encryption/decryption alphabet – maybe part of the string indicates which alphabet to use?
  • Concatenate / interleave multiple encrypted strings – you might have one string for CustomerID, one that identifies the particular email template, and one for the link location within the email

Of course, you MUST check that your end-to-end encryption/decryption process works on all inputs you’re likely to put through it! Given that your PC can probably work through millions of encrypt/decrypt cycles per second, there’s no reason not to check that your functions will work without error in the future.

, , , ,

Leave a comment

Interest and APR: a short explainer

I needed recently to check the monthly payment amount for a simple loan; choosing four online loan calculators at random, I got two slightly different values. In this PDF document Interest rates & APR (v0.1), I’ve detailed the differences (spoiler: it’s due to APR) and shown how the figures from the online calculators were arrived at. As ever, I hope someone finds it useful!

A plea: If you know of any book that goes into the various types of loan scheme in some depth, in a fairly formal and systematic way, please do let me know; it’s an area I find fascinating, yet there doesn’t seem to be a single definitive online resource with quite enough depth.

Finally, I must sing the praises of Wolfram Alpha, it’s been invaluable for my work for the past couple of years. Between Wolfram Alpha, wxMaxima and R, they’ve been able to solve any equation I’ve had cause to throw at them!

, , , ,

Leave a comment