Searching a database for a specific value

I never thought I’d ever need to write this code, but seeing as I did, I may as well share it! I hope you never have to resort to using it…

Recently, I’ve been working on a project that involved Microsoft SQL Server databases from a couple of third-party vendors (one of whom is well-known in the industries I work in). Somewhat shockingly, neither of these vendors provide any documentation for their databases — we were given database backups, and had to work out the structure and ‘what column names represent’ for ourselves. One of the databases wasn’t too bad, the table and column names made some sense; the other was highly normalised (over-normalised, in my opinion) and many of the objects were cryptically named.

Both databases contained data that we (as a company) had submitted, so we knew that certain values had to be present in the databases somewhere. The databases were too large to just run SELECT * on all the tables and manually look for our data, so I wrote SQL code to find the values we were looking for. In this way, we managed to piece together the queries we required. (Having documentation would’ve been preferable, but needs must…)

So, here’s the code I wrote; it’s not a stored procedure, you just paste it in to a query window, put in your requirements at the top, then run it.

I’ll demonstrate using Microsoft’s AdventureWorks database (AdventureWorks2017 on my PC).

IMPORTANT CAVEATS:

  1. Although this code has run very quickly every time I’ve used it, I was running it on an unimportant server, where more often than not, I was the only user. Be careful not to bring your server down! If in doubt, talk to your DBA.
  2. The code wasn’t written to be particularly friendly, so if you have specific requirements, you’ll have to tweak it.
  3. There’s nothing clever about this code, it’s just brute forcing the problem.

The code

-- You have to give values for the following 3 variables:
DECLARE @TypesOfInterest NVARCHAR(MAX) = 'int,smallint,bigint,tinyint'
DECLARE @ValueOfInterest NVARCHAR(MAX) = '19237'
DECLARE @StopWhenFound BIT = 0

-- ****************************************************************************
-- PART ONE: Get the tables/columns that match our datatypes of interest:
-- ****************************************************************************
DROP TABLE IF EXISTS #MatchingCols

;WITH cte_AllColumns AS
(
    SELECT
            TableName        = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']'
            ,ColumnName      = '[' + c.COLUMN_NAME + ']'
            ,ColSortOrder    = c.ORDINAL_POSITION
            ,PK              = IIF(pk.COLUMN_NAME IS NOT NULL, 1, 0)
            ,MatchingType    = IIF(x.DATA_TYPE IS NOT NULL, 1, 0)
        FROM INFORMATION_SCHEMA.COLUMNS c
        JOIN INFORMATION_SCHEMA.TABLES t
            ON t.TABLE_CATALOG = c.TABLE_CATALOG
            AND t.TABLE_SCHEMA = c.TABLE_SCHEMA
            AND t.TABLE_NAME = c.TABLE_NAME
            AND t.TABLE_TYPE = 'BASE TABLE' -- Note: i.e. not views, just tables
        LEFT JOIN (
            SELECT DATA_TYPE = s.[value]
                FROM STRING_SPLIT(@TypesOfInterest, ',') s
        ) x
            ON x.DATA_TYPE = c.DATA_TYPE
        LEFT JOIN (
            SELECT
                    ku.TABLE_CATALOG
                    ,ku.TABLE_SCHEMA
                    ,ku.TABLE_NAME
                    ,ku.COLUMN_NAME
                FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS tc
                JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS ku
                    ON ku.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
                WHERE tc.CONSTRAINT_TYPE = 'PRIMARY KEY' 
        ) pk
            ON pk.TABLE_CATALOG  = c.TABLE_CATALOG
            AND pk.TABLE_SCHEMA  = c.TABLE_SCHEMA
            AND pk.TABLE_NAME    = c.TABLE_NAME
            AND pk.COLUMN_NAME   = c.COLUMN_NAME

), cte_StringAgg AS
(
    SELECT 
            TableName
            ,SELECT_Cols = STRING_AGG(IIF((PK = 1) OR (MatchingType = 1 AND PK = 0), ColumnName, NULL),',')
                                WITHIN GROUP (ORDER BY ColSortOrder)
            ,WHERE_Cols  = STRING_AGG(IIF(MatchingType = 1, ColumnName + ' = ##', NULL), ' OR ')
                                WITHIN GROUP (ORDER BY ColSortOrder)
        FROM cte_AllColumns
        GROUP BY TableName
)
    SELECT
            ID = ROW_NUMBER() OVER (ORDER BY TableName)
            ,TableName
            ,SELECT_Cols
            ,WHERE_Cols
        INTO #MatchingCols
        FROM cte_StringAgg
        WHERE WHERE_Cols IS NOT NULL

-- ****************************************************************************
-- PART TWO: Loop through the tables:
-- ****************************************************************************

DROP TABLE IF EXISTS ##Results
-- Note: Global temp table so I can look at the results
-- as they're inserted, in a different query window.

CREATE TABLE ##Results
(
    TableName NVARCHAR(256) NOT NULL PRIMARY KEY CLUSTERED
    ,[Query] NVARCHAR(MAX) NOT NULL
    ,CreatedOn DATETIME NOT NULL DEFAULT(GETDATE())
)

DECLARE @id INT = 1, @maxid INT, @SELECT_sql NVARCHAR(MAX), @INSERT_sql NVARCHAR(MAX), @TableName NVARCHAR(256), @rc INT
SELECT @maxid = MAX(id) FROM #MatchingCols
WHILE (@id <= @maxid)
BEGIN

    SET @TableName  = NULL
    SET @SELECT_sql = NULL
    SET @INSERT_sql = NULL
    SET @rc         = NULL

    SELECT
            @TableName    = m.TableName
            ,@SELECT_sql  = N'SELECT TOP 1 ' + m.SELECT_Cols + N' FROM ' + m.TableName + N' WHERE (' + REPLACE(m.WHERE_Cols, '##', @ValueOfInterest) + N')'
        FROM #MatchingCols m
        WHERE id = @id
    PRINT @SELECT_sql

    SELECT @INSERT_sql = N'IF EXISTS(' + @SELECT_sql + N') INSERT ##Results(TableName, [Query]) VALUES(''' + @TableName + N''', ''' + @SELECT_sql + N''')'
    EXEC sp_executesql @stmt = @INSERT_sql
    SET @rc = @@ROWCOUNT

    IF @StopWhenFound = 1
    BEGIN
        IF @rc > 0 BREAK
    END

    SET @id = @id + 1
END
GO
-- And finally, the results:
SELECT * FROM ##Results
GO

(NB: The part of the query that indicates which columns are involved in the primary key, I took from this StackOverflow post.)

The three pieces of information you have to supply (as seen at the top of the code) are:

  • @TypesOfInterest : A comma-separated list of valid SQL Server datatypes (as found in the table sys.types)
  • @ValueOfInterest : The piece of information you’re looking for. If you’re searching for a string, you’ll have to put it in quoted quotes, e.g.
    DECLARE @ValueOfInterest NVARCHAR(MAX) = ”’Dobney”’ (3 single quotes each side of the word/phrase)
  • @StopWhenFound : If 1, the code stops when the value is found for the first time. If 0, all tables are searched.

When I run this in the AdventureWorks2017 database, I get

If I select the ‘Query’ field in 8th row, copy it out to another query window (removing the TOP 1):

SELECT [TransactionID],[ProductID],[ReferenceOrderID],[ReferenceOrderLineID],[Quantity]
FROM [Production].[TransactionHistoryArchive]
WHERE ([TransactionID] = 19237 OR [ProductID] = 19237 OR [ReferenceOrderID] = 19237
	OR [ReferenceOrderLineID] = 19237 OR [Quantity] = 19237)

and then run it, we get:

and we’ve found a couple of instances of our search value, 19237.

Suggested improvements

This is supposed to be a piece of code of limited use, so I’m not planning to spend any more time making it more user-friendly. However, if I was going to do so:

  1. I’d record the primary key values in the SELECT statement (the Query column in the table ##Results), so SQL Server could go straight to the record, rather than (potentially) doing another table scan — but it doesn’t really matter, as the query results will probably be cached anyway.
  2. I’d generate a field that shows exactly which columns matched the value we’re looking for.
  3. Maybe the input could be friendlier, automatically adding quotes to string types.

Anyway, hopefully this code will be useful to someone — do let me know if you’ve been driven to use it!

, , ,

Leave a comment

CREATE the definition of a temp table (in SQL)

At time of writing, the main MS SQL database I work with day-to-day has hundreds of tables, and many of those tables have hundreds of columns. Alongside these tables are hundreds of views, sprocs, and assorted functions. Writing code therefore has this extra overhead — working out where data comes from, and how it’s modified — that has to be factored in when it comes to estimating how long pieces of work will take.

Hence, I find that I’m always ‘writing code to write code’ – sometimes dynamic SQL, sometimes pasting column names in and out of Excel, but often just writing ‘helper’ queries that return recordsets that contain, e.g. lists of columns, that I paste back into my SQL query window. Below, I’ll demonstrate the kind of code I use to create table definitions from existing temp tables.

Why do we need this? It’s useful because you can do SELECT {columns} INTO #MyTempTable FROM {tables} without having to specify any datatypes – the types of the columns in the temporary table #MyTempTable are derived behind the scenes. Using the code below, I can turn my temp tables into real ones.

To begin with, we’ll dummy up a temp table, #MyTempTable, full of random data:

DROP TABLE IF EXISTS #MyTempTable
GO

;WITH cte_Random AS
(
	SELECT
			ID = [Number]
			,Var_Int08 = CAST(CAST(NEWID() AS BINARY(16)) AS TINYINT) 
			,Var_Int16 = CAST(CAST(NEWID() AS BINARY(16)) AS SMALLINT) 
			,Var_Int32 = CAST(CAST(NEWID() AS BINARY(16)) AS INT) 
			,Var_Int64 = CAST(CAST(NEWID() AS BINARY(16)) AS BIGINT) 
	FROM (
	SELECT TOP 1000
		[Number] = CAST(ROW_NUMBER() OVER (ORDER BY object_id, column_id) AS INT)
		FROM sys.columns
	) x
)
SELECT
	*
	,Var_Float24		= CAST(1.0 * Var_Int32 / Var_Int16 AS FLOAT(24))
	,Var_Float53		= CAST(1.0 * Var_Int64 / Var_Int32 AS FLOAT(53))
	,Var_Decimal1801	= CAST(1.0 * Var_Int32 / Var_Int16 AS DECIMAL(18,1))
	,Var_Decimal3810	= CAST(1.0 * Var_Int64 / Var_Int32 AS DECIMAL(38,10))
	,Var_Date		= CAST(DATEADD(day, Var_Int16, '2021Mar09') AS DATE)
	,Var_DateTime		= DATEADD(s,  Var_Int32, CAST('2021Mar09' AS DATETIME))
	,Var_DateTime2		= DATEADD(ns, Var_Int32, CAST('2021Mar09' AS DATETIME2))
	,Var_Char10		= CAST(CAST(NEWID() AS BINARY(16)) AS CHAR(10))
	,Var_Varchar20		= CAST(CAST(NEWID() AS BINARY(16)) AS VARCHAR(20))
	,Var_Nvarchar40		= CAST(CAST(NEWID() AS BINARY(16)) AS NVARCHAR(40))
	,Var_SmallMoney 	= CAST(Var_Int32 / 10000.0 AS SMALLMONEY)
	,Var_Money 		= CAST(Var_Int64 / 10000.0 AS MONEY)
INTO #MyTempTable
FROM cte_Random
GO
-- Let's say we require the ID column to be non-NULL:
ALTER TABLE #MyTempTable ALTER COLUMN ID INT NOT NULL
GO

I’ve included several of the different datatypes, but not all of them; there are quite a few recognised by MS SQL (you can read about them here: Data types (Transact-SQL) and An overview of SQL Server data types), but these ones above are the data types I deal with on a typical day.

(NB: It works exactly the same for global temp tables, so we could’ve called our table ##MyTempTable instead.)

Let’s look at some of the data:

SELECT TOP 10 * FROM #MyTempTable

ID   Var_Int08 Var_Int16 Var_Int32   Var_Int64             Var_Float24
---- --------- --------- ----------- --------------------- ------------
1    14        -30998    658979450   -9167272261900257581  -21258.8
2    137       -27919    -341309392  -6857223736286779664  12225
3    133       -30579    -1481699359 -4780055374266335512  48454.8
4    84        26124     2985519     -7160508429147909805  114.283
5    61        14039     842379332   -7969292711582563318  60002.8
6    89        7825      -32357595   -7605166744987937108  -4135.16
7    153       29284     1888870407  -8917976039370151771  64501.8
8    40        -31383    -1906986840 -9210204572955484902  60765
9    149       -7316     -1955688049 -8633019267239556802  267317
10   202       -12795    901983213   -7712186477348472978  -70495

(I’ve only shown the first few columns)

Looks ok, random enough… apart from Var_Int64, which is always negative? The contents of the table don’t matter at all for our current purposes, but we’ll come back to this at the end, because it’s quite interesting!

First, we need to know how MS SQL refers to our temp table. You may know that tables in the tempdb database (where temp tables live) have slightly modified names:

SELECT object_id, name FROM tempdb.sys.objects WHERE name LIKE N'#MyTempTable%'

object_id	name
------------    -------------------------------------------------------------
-1514606060	#MyTempTable____{....lots of underscores...}_____000000000003

The object_id and name will be different on your server. The reason for the underscores / digits after #MyTempTable is because more than one user can create a temp table with the same name, so under the hood, SQL Server gives the table a unique name to keep track of it — however, we don’t need to know this; as long as we stay in the same session, we can keep referring to the table as #MyTempTable.

ANNOYING NOTE: I use Azure SQL Server, which drops temp tables after a few minutes if the session hasn’t seen any activity. I’ve yet to determine whether the amount of time is something our administrators can control (they don’t think so). I’ve added this to my ever-increasing list of ‘Things About Azure I Don’t Like’!

Anyway, we know where our table lives, so we can use the standard view INFORMATION_SCHEMA.COLUMNS to show us the column definitions. We’ll pull out the information we need to be able to create the SQL datatype fragments for each column, and store it all in a temp table:

DROP TABLE IF EXISTS #ColDefs
GO

DECLARE @MyTableName NVARCHAR(128) = '#MyTempTable'

;WITH cte_Columns AS
(
	SELECT
		COLUMN_NAME
		,ORDINAL_POSITION
		,IS_NULLABLE
		,DATA_TYPE
		,CHARACTER_MAXIMUM_LENGTH
		,NUMERIC_PRECISION
		,NUMERIC_SCALE
	FROM tempdb.INFORMATION_SCHEMA.COLUMNS c
	WHERE c.TABLE_NAME = (
		SELECT [name]
			FROM tempdb.sys.objects
			WHERE OBJECT_ID = OBJECT_ID('tempdb.dbo.' + @MyTableName)
	)
)
SELECT 
	ColName = '[' + COLUMN_NAME + ']'
	,ColType =
		CASE WHEN DATA_TYPE = 'real' THEN 'FLOAT' ELSE UPPER(DATA_TYPE) END
		+ CASE
		WHEN DATA_TYPE IN ('char','varchar','nvarchar') THEN '(' + CAST(CHARACTER_MAXIMUM_LENGTH AS VARCHAR) + ')'
		WHEN DATA_TYPE IN ('decimal','numeric') THEN '(' + CAST(NUMERIC_PRECISION AS VARCHAR) + ',' + CAST(NUMERIC_SCALE AS VARCHAR) + ')'
		WHEN DATA_TYPE IN ('float','real') THEN '(' + CAST(NUMERIC_PRECISION AS VARCHAR) + ')'
		ELSE '' END
	,IsNullable = IIF(IS_NULLABLE = 'YES', 1, 0)
	,SortOrder = ORDINAL_POSITION
INTO #ColDefs
FROM cte_Columns
ORDER BY ORDINAL_POSITION
GO

Note that numeric and decimal are synonyms, and real is equivalent to float(24). If you’re going to add in extra data types, you’ll have to modify the code that sets ColType.

So, what do we have in our #ColDefs table?

SELECT * FROM #ColDefs

ColName           ColType         IsNullable  SortOrder
----------------- --------------- ----------- -----------
[ID]              INT             0           1
[Var_Int08]       TINYINT         1           2
[Var_Int16]       SMALLINT        1           3
[Var_Int32]       INT             1           4
[Var_Int64]       BIGINT          1           5
[Var_Float24]     FLOAT(24)       1           6
[Var_Float53]     FLOAT(53)       1           7
[Var_Decimal1801] DECIMAL(18,1)   1           8
[Var_Decimal3810] DECIMAL(38,10)  1           9
[Var_Date]        DATE            1           10
[Var_DateTime]    DATETIME        1           11
[Var_DateTime2]   DATETIME2       1           12
[Var_Char10]      CHAR(10)        1           13
[Var_Varchar20]   VARCHAR(20)     1           14
[Var_Nvarchar40]  NVARCHAR(40)    1           15
[Var_SmallMoney]  SMALLMONEY      1           16
[Var_Money]       MONEY           1           17

Now we have enough information to be able to recreate our CREATE, INSERT and SELECT statements:

DECLARE @CreateSQLCols VARCHAR(MAX)
	,@InsertSQLCols VARCHAR(MAX)
	,@SelectSQLCols VARCHAR(MAX)

;WITH cte_SQL AS
(
SELECT
		sep = IIF(SortOrder > 1, ',', '')
		,ColName
		,ColType
		,NullSQL = IIF(IsNullable = 1, 'NULL', 'NOT NULL')
		,SortOrder
		,MaxColNameLen = MAX(LEN(ColName)) OVER () -- For our nicely-tabbed SELECT statement
	FROM #ColDefs
)
SELECT
	@CreateSQLCols	= STRING_AGG(sep + ColName + ' ' + ColType + ' ' + NullSQL, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
	,@InsertSQLCols	= STRING_AGG(sep + ColName, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
	,@SelectSQLCols	= STRING_AGG(sep + ColName + REPLICATE(CHAR(9), 1 + (MaxColNameLen - LEN(ColName)+1)/4) + '= x.' + ColName, CHAR(13) + CHAR(10)) WITHIN GROUP (ORDER BY SortOrder)
FROM cte_SQL

SELECT CreateSQLCols = @CreateSQLCols, InsertSQLCols = @InsertSQLCols, SelectSQLCols = @SelectSQLCols, ColListSQL = REPLACE(@InsertSQLCols, CHAR(13) + CHAR(10), '')

which looks like this in the Results pane:

We can now just cut’n’paste into our query window. CreateSQLCols is:

[ID] INT NOT NULL
,[Var_Int08] TINYINT NULL
,[Var_Int16] SMALLINT NULL
,[Var_Int32] INT NULL
,[Var_Int64] BIGINT NULL
,[Var_Float24] FLOAT(24) NULL
,[Var_Float53] FLOAT(53) NULL
,[Var_Decimal1801] DECIMAL(18,1) NULL
,[Var_Decimal3810] DECIMAL(38,10) NULL
,[Var_Date] DATE NULL
,[Var_DateTime] DATETIME NULL
,[Var_DateTime2] DATETIME2 NULL
,[Var_Char10] CHAR(10) NULL
,[Var_Varchar20] VARCHAR(20) NULL
,[Var_Nvarchar40] NVARCHAR(40) NULL
,[Var_SmallMoney] SMALLMONEY NULL
,[Var_Money] MONEY NULL

InsertSQLCols is the same list but just the column names. SelectSQLCols is the same, but with an assignment (e.g. [Var_Int32] = x.[Var_Int32]) so we can make changes to individual columns if we need to — note also we’ve calculated the number of tabs needed, so the columns line up, which makes it easier to read. ColListSQL is the same as InsertSQLCols, but all the columns are on one line (still comma-separated).

If we want to, we can build the whole CREATE statement (it’s officially called DDL, Data Definition Language):

SELECT CreateSQL = 'CREATE TABLE dbo.MyTable' + CHAR(13) + CHAR(10) + '(' + CHAR(13) + CHAR(10) + CHAR(9) + REPLACE(@CreateSQLCols, CHAR(13) + CHAR(10), CHAR(13) + CHAR(10) +  CHAR(9)) + CHAR(13) + CHAR(10) + ')' + CHAR(13) + CHAR(10)

NB: CHAR(13) + CHAR(10) is a carriage return + line feed, it moves us on to a new line. CHAR(9) is a tab.

As you can hopefully see, when you have hundreds (and hundreds and hundreds…) of columns to deal with, it’s much easier to wrangle them in code, rather than type SQL out by hand and risk mistakes. If you’re not used to this way of writing SQL, it can seem daunting, but you quickly get the hang of it — and it’s a huge time-saver.


What’s going on with Var_Int64?

Finally, we come back to the issue with Var_Int64, it doesn’t look as random as the other columns.

Using NEWID() as a basis to generate other random data is pretty common; most of the time, we do something like this:

SELECT ABS(CAST(CAST(NEWID() AS BINARY(16)) AS INT)) % 20
-- or even:
SELECT ABS(CHECKSUM(NEWID())) % 20

which gives us a random integer between 0 and 19. (See CHECKSUM (Transact-SQL).) Now, while NEWID might give us a GUID that is random ‘overall’, not every part of it is wholly random.

Below is part of a recordset consisting of some randomly-generated GUIDs (using NEWID), the GUID converted to binary, then that binary converted to an integer:

guid                                 CAST(guid AS BINARY(16))           CAST([2] AS INT)
------------------------------------ ---------------------------------- --------------------
1F392DA2-9DEE-40CE-ACF5-0F70A1FB76E3 0xA22D391FEE9DCE40ACF50F70A1FB76E3 -5983859553463470365
30225F93-0C00-4A51-A63B-F84A0EE84177 0x935F2230000C514AA63BF84A0EE84177 -6468303442826215049
B7ED94A5-1274-4E96-B736-5E28C69B9BFF 0xA594EDB77412964EB7365E28C69B9BFF -5244901186802574337
753E907A-C500-4D83-879F-A76872407566 0x7A903E7500C5834D879FA76872407566 -8674030290257021594
D4C732E8-59DD-49D7-AF70-090F74A32DC3 0xE832C7D4DD59D749AF70090F74A32DC3 -5805129957694558781
D2855FC4-637B-45BA-BCB0-7BC16EA37648 0xC45F85D27B63BA45BCB07BC16EA37648 -4850240727962913208
5057A49D-0275-467A-A44D-81C3552D51A2 0x9DA4575075027A46A44D81C3552D51A2 -6607482402335010398
7916A696-F878-4245-A192-76CDC032306F 0x96A6167978F84542A19276CDC032306F -6804245460938510225
2D528B2C-574B-459A-8456-75B2A6B3CA03 0x2C8B522D4B579A45845675B2A6B3CA03 -8910805402544518653
130A63EE-9060-4434-8B3C-D05E3BFE1805 0xEE630A13609034448B3CD05E3BFE1805 -8413620900682917883

First, notice how the GUID is converted into binary: the last 16 digits are in the same order, but for the first 16, they are reversed within each block. Now look at the 3rd block of digits in the GUIDs — they all begin with a ‘4’. And the first digit in the 4th block doesn’t range much either. In fact, I generated a million GUIDs, and for all of them, the 13th digit was ‘4’, and the 17th digit only took the values ‘8’,’9′,’A’ and ‘B’. Now, I’m not going to pretend I fully understand what’s going on here, but from the wiki entry (Universally unique identifier), it looks like the GUIDs are being generated using the current date/time, and my MAC address (network address), so maybe it’s not surprising that some digits have restricted ranges.

Let’s see what happens when we convert some BIGINT values to a BINARY(16):

BIGINT                BINARY(16)
--------------------  ----------------------------------
-9223372036854775808  0x00000000000000008000000000000000  -- smallest possible value of BIGINT
         -2147483648  0x0000000000000000FFFFFFFF80000000  -- smallest possible value of INT 
              -32768  0x0000000000000000FFFFFFFFFFFF8000  -- smallest possible value of SMALLINT
                  -2  0x0000000000000000FFFFFFFFFFFFFFFE
                  -1  0x0000000000000000FFFFFFFFFFFFFFFF
                   0  0x00000000000000000000000000000000
                   1  0x00000000000000000000000000000001
                   2  0x00000000000000000000000000000002
               32767  0x00000000000000000000000000007FFF  -- largest possible value of SMALLINT
          2147483647  0x0000000000000000000000007FFFFFFF  -- largest possible value of INT
 9223372036854775807  0x00000000000000007FFFFFFFFFFFFFFF  -- largest possible value of BIGINT

A BIGINT is only 8 bytes, so when we convert from BINARY(16), the first 8 bytes (16 hex digits) get ignored; our final number is taken from the 17th to 32nd digits of the GUID only. As we’ve seen from the above, the 17th digit only ranges from ‘8’ to ‘A’ — and due to the way signed integers are stored, this means the number will always be negative.

The lesson here is: however you create your random data in SQL, check that it really is random, especially if you’re using it for statistical purposes!

, , , , , ,

Leave a comment

Scorecards versus Neural Networks

First, some explanatory points:

  • More often than not, scorecards are just scaled logistic regression models, if that makes this post easier to follow.
  • By ‘ML’, I mean the family of modern machine learning algorithms – neural networks (NNs), random forests, support vector machines, etc.
  • What I’ve written here has a big caveat: This happens in my experience. If you’ve experienced differently, please do let me know!

Introduction

I must apologise: the title, ‘Scorecards versus Neural Networks’, is sort of clickbait-y; these two things are related, but not the same — it’s a bit like saying ‘Bicycles versus Prams’, or ‘Pasta versus Bread’. The point of this blog post is that in my experience the people who build these models have different mindsets, which ⚠ spoiler ⚠ leads to scorecards having better results (in certain domains) than ML models.

Of course, neural networks can do things that a scorecard could never do — image recognition, for example — but there are good model-building practices that should apply whatever flavour of model you’re creating.

Scorecards

I’ve worked for a number of consumer lending businesses, and all of them used (one or more) scorecards as their primary lending decision mechanism. This isn’t just because I worked there, and scorecards are my ‘thing’; all the heads of Credit Risk I’ve worked with wanted to use scorecards. Scorecards are an old-fashioned technology, but they’re very easy to understand; they’re not a ‘black box’ like more modern ML techniques. Building a good scorecard is somewhere between an art and a science, but understanding it, what it’s doing and why it arrives at its final score, couldn’t be simpler: it’s just a case of adding up points for positive characteristics, taking away points for negative. An increase in a variable relating to a positive characteristic, leaving all other variables alone, gives you an increase in score, and vice-versa*. Due to this simple nature, scorecards are trivial to code ‘in production’.

However, neural networks and other ML models tend to be highly non-linear, and there’s absolutely no guarantee that changing one input variable will give you the simple increase or decrease in score you were expecting. And they are definitely not trivial to implement.

* This isn’t necessarily true if you build bad scorecards, but I try not to build bad scorecards 😉


Machine learning getting it wrong

Here are a few instances of machine learning models not behaving as expected; they’re not in the credit risk domain, but that doesn’t matter, there are principles in common.

1. “‘Typographic attack’: pen and paper fool AI into thinking apple is an iPod” (2021)

But even cleverest AI can be fooled with the simplest of hacks. If you write out the word “iPod” on a sticky label and paste it over the apple, Clip does something odd: it decides, with near certainty, that it is looking at a mid-00s piece of consumer electronics [rather than an apple]. In another test, pasting dollar signs over a picture of a dog caused it to be recognised as a piggy bank.

https://www.theguardian.com/technology/2021/mar/08/typographic-attack-pen-paper-fool-ai-thinking-apple-ipod-clip

2. “Amazon scraps secret AI recruiting tool that showed bias against women” (2018)

But by 2015, the company realized its new system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way. That is because Amazon’s computer models were trained to vet applicants by observing patterns in resumes submitted to the company over a 10-year period. Most came from men, a reflection of male dominance across the tech industry. In effect, Amazon’s system taught itself that male candidates were preferable.

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

3. “A bookshelf in your job screening video makes you more hirable to AI” (2021)

The addition of art or a bookshelf in the background made an Asian test subject seem much more conscientious and significantly less neurotic compared to the same faux applicant in front of a plain background.

https://www.inputmag.com/culture/a-bookshelf-in-your-job-screening-video-makes-you-more-hirable-to-ai

4. “AI image recognition fooled by single pixel change” (2017)

Computers can be fooled into thinking a picture of a taxi is a dog just by changing one pixel, suggests research.

https://www.bbc.com/news/technology-41845878

5. “Twitter apologises for ‘racist’ image-cropping algorithm” (2020)

Twitter has apologised for a “racist” image cropping algorithm, after users discovered the feature was automatically focusing on white faces over black ones. The company says it had tested the service for bias before it started using it, but now accepts that it didn’t go far enough.

https://www.theguardian.com/technology/2020/sep/21/twitter-apologises-for-racist-image-cropping-algorithm


From these stories, you could be forgiven for concluding that ML models aren’t all that they’re cracked up to be. Actually, I rather think the stories point to one or more of the following issues:

  • A misunderstanding of the problem domain
  • Inadequate or incorrect training data
  • A lack of testing, post-model build
  • Inadequate monitoring of models in production
  • Confusion about the limitations of models

Scorecards getting it wrong?

The above stories are all high-profile, because ‘AI’ is news. Scorecards aren’t news, they’re niche; and if a company lends to the wrong people due to a fault with the scorecards, the general public doesn’t get to hear about it.

Scorecards can suffer from each of the five issues I itemised above, in exactly the same way as ML. But scorecard builders will likely spend a lot more time ensuring their scorecard is valid.

I don’t really know of any instances of scorecards going wrong, other than data being incorrectly coded; it’s unfortunately quite common for missing data (blanks, NULLs) or extreme (‘out-of-bounds’) values to get coded as the ‘default’ modal value. There’s an apocryphal story about a meaningless random variable making it all the way through to production, but I think the story exists to scare junior scorecard builders. 😉

But really, a scorecard shouldn’t go live without being thoroughly (and provably) tested. It’s a serious business, and there’s often a great deal of money at stake.


What are neural networks actually doing?

The channel 3Blue1Brown (run by mathematician Grant Sanderson) has a series of ‘Deep learning’ videos that explain neural networks from the ground up. The first video, But what is a Neural Network?, describes a dataset of handwritten digits, scanned and discretised, that are used as example data to build a neural network. In fact, the classification rate is very impressive: the neural network classifies 98% of digits correctly.

In the second video, Gradient descent, how neural networks learn, we’re shown pictorially (at around the 14 minute mark) what the network is doing – it’s not picking out lines and loops and shapes common to numbers, as you might expect, but:

“[I]t would seem that in the unfathomably large 13,000 dimensional space of possible weights and biases, our network found itself a happy little local minimum that, despite successfully classifying most images, doesn’t exactly pick up on the patterns that we might have hoped for…

To human eyes, the patterns don’t look like anything at all, certainly nothing resembling numbers. And yet the NN is working at an accuracy rate of 98%! Even more disturbing:

… and to really drive this point home, watch what happens when you input a random image; if the system is smart, you might expect it to either feel uncertain, maybe not really activating any of those 10 output neurons, or activating them all evenly, but instead it confidently gives you some nonsense answer

So, with regards to this particular neural network:

  • You can’t tell how it works — you only know that it does.
  • If you put nonsense data in, you get a real (but totally irrelevant) answer out

As is pointed out in the videos, the network used for this model is somewhat old-fashioned, there are more modern alternatives.


A couple of anecdotes (not quite ‘case studies’) from my own personal experience:

Anecdote #1: Scorecard vs. neural networks in consumer lending

At one of the businesses I worked for, we trialled a neural network model to see if it could out-perform our recently-built scorecards. A well-known company (in the decision science arena) took our data and used their flagship tool to build a neural net model. Their whole process was web-based, end to end, and the interface (GUI/UX) was beyond impressive. They presented their model, which had a gini value 5 percentage points higher than our scorecard; that is, more predictive, and enough so that the savings made by better predictions were greater than the cost of using their software. [NB: The gini is just a classification score, from 0% (no better than randomly guessing) to 100% (perfect answer every time).]

However: this initial model used hundreds of variables (if I recall correctly, around 300), as opposed to our scorecard’s 20 or so. A very important part of managing risk is understanding the variables that make up a model, how they might vary over time, and their impact on the result. Because of their approach of using all the data available without question, their model had included a variable that was actually removed half-way through the lifetime of the build population, and wasn’t available in production. Once this variable was removed, and the number of variables reduced to under 100, their model gini was only marginally higher than our scorecard’s. Taking into account the operating cost of the model, there was no financial gain to be had from using the more complicated model.

I subsequently worked with another lender (in this case, B2B) who used the same decision science company’s product to provide a credit-worthiness score – it was one of several scores available to underwriters, all the loans were manually assessed. The lead underwriter said that in her opinion, the score was very good, but would occasionally give scores that didn’t make sense alongside their other data. This meant underwriters began distrusting the neural network score, and because there was no way of understanding WHY the score was how it was, it ended up unused in the overall decision-making process, even though they were still paying for it for each application.

For me, the lesson here is not that the model (or software) was in any way faulty – it did exactly what was being asked of it. But there was a lack of attention to detail with regard to (a) the input data, and (b) how the models were going to be used in practice.

Anecdote #2: Scorecard vs. ML models in B2B marketing

I was very recently involved in a marketing project that pitted a scorecard against a random forest model, built by a third-party company (not the same company as in the previous section). The dataset comprised several thousand companies, with some basic data (e.g. location, size, industry classification code), and an outcome indicating whether previous marketing had been successful. For new companies, could we predict whether marketing to them would be worth the cost?

Apart from the different domain, it was a very similar exercise to the above. The main difference was in the lack of ‘per company’ data; for lending models, you typically have access to thousands of variables from a credit file. Instead, we had to get creative, and derive our own variables from the limited amount of information we had.

The third-party had gone through several iterations of models — including neural networks, support vectors machines (SVM) and logistic regression, among others — and settled upon a random forest as giving the best results.

In the end, despite taking more than four times as many person-days to build, when used in production, their random forest model had a slightly lower gini than our scorecard — it wasn’t as good. In my post-project documentation, I put this down to a lack of creativity at the variable creation stage; during our build, we’d managed to generate dozens of new variables from our existing data, a few of which were very predictive.

For full disclosure, I should mention that we subsequently ran both models over a dataset that the models weren’t built on, just to see what would happen: the random forest did pretty well, the scorecard was awful! Unfortunately, there was no allotted time to investigate why we saw these results, but it shows that there was a robustness to the ML model that we weren’t expecting.


End notes

I think my central point is fairly obvious, but to state it explicitly: a simple model built by a domain expert will be (probably much) better than a complicated model built by an expert in building complicated models. To illustrate the difference: for consumer lending scorecards I’ve built, the timeline was divided up approximately like this:

Stage Time taken
Data gather (including cleaning) 20%
Variable assessment (including creating new ones) 25%
Scorecard build 15%
Checking, validating and documenting the model 40%

But when third-party companies have built models, their timeline was more like this:

Stage Time taken
Data gather (including cleaning) 5%
Variable assessment (including creating new ones) 5%
Model build 80%
Checking, validating and documenting the model 10%

Nearly all the time is put into the model build — e.g. trying the different algorithms (neural net vs random forest vs logisitic regresion …), tuning hyper-parameters, etc., in order to get a good classification rate.

But building a good model is about so much more than the overall classification rate, or the algorithm you choose. There’s a huge chunk of work to be done, pre- and post- model build. If that gets ignored, then the selected model could be a poor one. I’m sure the third-party companies were perfectly capable of doing the extra work; but it has to be communicated, costed, and agreed upon.


A final couple of points:

, , , , ,

2 Comments

Turning integers into rational factorial expressions

Caution! Maths up ahead!

I have a long-term side project that involves dealing with rational factorial expressions, that reduce to (often fairly large) integers, e.g.:

\frac{20! 18! 11!}{15! 12! 9!} = 2735405697024000

Typically, I’ll start with an integer (a result of a sum of combinatorial expressions), and require to turn it back into a purely factorial expression. My initial approach was to create a database of these integers: loop over all the possible permutations of factorials, and store the results. Here’s a snippet of the data:

(17!.17!.9!.7!)/(13!.7!.1!) = 7372584295195056537600000
(17!.17!.9!.7!)/(13!.7!.2!) = 3686292147597528268800000
(17!.17!.9!.7!)/(13!.7!.3!) = 1228764049199176089600000
(17!.17!.9!.7!)/(13!.7!.4!) = 307191012299794022400000
(17!.17!.9!.7!)/(13!.7!.5!) = 61438202459958804480000
(17!.17!.9!.7!)/(13!.7!.6!) = 10239700409993134080000

As you’d expect, this database got huge very quickly; this one particular formulation (a product of four factorials over a product of three factorials, using values up to 20) results in a table of 13 million rows, 643 MB in size. Although it works in principle, it’s not terribly practical.

Now, from our O-level Maths days, we know that all integers can be factorised into powers of primes, e.g.:

2735405697024000 = 2^{14} \times 3^{5} \times 5^{3} \times 7^{1} \times 11^{1} \times 13^{1} \times 17^{2} \times 19^{1}

The statement that every integer greater than 1 is either a prime, or a product of primes, is known as the Fundamental theorem of arithmetic, due to Gauss. For the above integer, I calculated the exponents in R using the factorize function from the gmp package, which is used for manipulating big integers and rationals of any size.

table(as.integer(factorize(2735405697024000)))
 2  3  5  7 11 13 17 19 
14  5  3  1  1  1  2  1 

If we factorise our six factorials in the first identity similarly, and tabulate them nicely, we get:

            (power of)
N!      2  3  5  7 11 13 17 19 
------------------------------
20! =  18  8  4  2  1  1  1  1 
18! =  16  8  3  2  1  1  1  0
11! =   8  4  2  1  1  0  0  0
------------------------------
15! =  11  6  3  2  1  1  0  0
12! =  10  5  2  1  1  0  0  0
 9! =   7  4  1  1  0  0  0  0
------------------------------
ans.=  14  5  3  1  1  1  2  1

For each power, the sums of the columns in the top block (the numerator of the LHS of the identity), minus those in the bottom block (denominator), give the power in the answer, e.g. in the 2 column, (18+16+8) – (11+10+7) = 14, which is the power of 2 in the result.

So if we want to turn an integer into a rational factorial expression, we just need a way of determining which linear combinations of the powers of our factorized factorials will give us the same powers as our target integer. For this, we need to turn our prime exponents into a matrix, and compute the Reduced row echelon form; as ever, there’s an existing R package that contains a function we can use!

Worked example

In order to demonstrate the process, we’ll keep the numbers small. Let’s say we have an integer, 120960, and we want to know if it can be turned into a rational factorial expression — it may not be possible.

120960 = 2^{7} \times 3^{3} \times 5^{1} \times 7^{1}

Here’s our table of factorial factors (which I’ve called factpp), and above it, the code I used to generate it:

require(gmp); # gmp library
maxprime <- 7;
pm <- primes(maxprime); # 2,3,5,7
npm <- length(pm);
factpp <- matrix(rep(0,(maxprime-1)*npm),nrow=(maxprime-1),ncol=npm); # factorial prime powers
for(i in 1:(maxprime-1))
{
	tv <- table(as.integer(factorize(factorial(i+1))));
	for(j in 1:npm) {
		snpm <- sprintf("%d",pm[j]);
		if (snpm %in% names(tv)) {
			factpp[i,j] <- tv[snpm];
		}
	}
}

factpp
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    1    1    0    0
[3,]    3    1    0    0
[4,]    3    1    1    0
[5,]    4    2    1    0
[6,]    4    2    1    1

(Note how the exponents are non-decreasing – a proof of this can be found here: Prime Factors of Factorial Numbers)

My code is pretty horrible (matching by converting ints to strings, yuck), and it can definitely be hugely improved – but I only need it once; when I’ve generated the prime factors for all the factorials I’m interested in, I don’t need to run it again.

We can check that the exponents make sense:

apply(factpp, 1, function(v) { (2^v[1])*(3^v[2])*(5^v[3])*(7^v[4])});
[1]    2    6   24  120  720 5040

These are clearly the values of 2! up to 7!. (We can ignore 1! = 1, we don’t need it.) Because the highest prime factor in 120960 is 7, we don’t need to consider factorials that have prime factors greater than 7.

To compute our reduced row echelon form, we require the function rref in the package pracma (see here for documentation).

require(pracma);
input <- cbind(t(factpp), c(7,3,1,1)); # bind factorial exponents to target exponents 
rref(input);
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    2    0    0    0    2
[2,]    0    1    1    0    1    0    1
[3,]    0    0    0    1    1    0    0
[4,]    0    0    0    0    0    1    1

This matrix is actually giving us a system of simultaneous equations, one equation per row; the values in the first 6 columns are the coefficients of the variables, and the final column on the right is the value on the right-hand side of the equals sign. To interpret this matrix (i.e. the output of rref), this video Interpreting RREF Augmented Matrices by MATHguide is succinct and useful.

With our reduced row echelon form, there are three possibilities:

  • There is exactly one solution
  • There are no solutions
  • There are infinitely many solutions

Because we have 4 equations and 6 variables, we know that we don’t have a single solution. From the above MATHguide video, we can see that the output of rref indicates we have infinitely many solutions – but that’s ok, it’s not going to cause us a problem.

From the output, our simultaneous equations can be written as:

a + 2c = 2
b + c + e = 1
d + e = 0
f = 1

which we can reduce to the following result vector with two variables:

(2-2c, 1-c-e, c, -e, e, 1)

So, for any integer values of c and e, we will get a valid result. The values of this vector indicate the power of the factorial in our rational expression. Let’s look at some examples:

 c   e  vector                  resulting factorial expression
--  --  ----------------------  ----------------------------------
 0   0  (2,  1,  0,  0, 0, 1)   2!2!3!7!
 1   0  (0,  0,  1,  0, 0, 1)   4!7!
 0   1  (2,  0,  0, -1, 1, 1)   (2!2!6!7!) / 5!
 1   1  (0, -1,  1, -1, 1, 1)   (4!6!7!) / (3!5!)
-2   2  (6,  1, -2, -2, 2, 1)   (2!2!2!2!2!2!3!6!7!) / (4!4!5!5!)

All of these are valid solutions (they all equal 120960), and we can choose whichever one best fits our needs.

(Why do we have more than one solution? Because simple variations of factorial products have the same values, e.g. 2!2!3! = 4! and 6!7! = 10!. If we removed rows from factpp — e.g. we removed the first row, relating to 2! — then we’d get fewer possible results.)

, , , , ,

Leave a comment

Practical queries using binary operations

One of my ongoing ‘fun’ projects is a Sudoku-solver, written using only SQL. I’m not interested in brute-forcing a solution — there already exists a beautiful (and tiny) piece of SQL to do that: see Using Common Table Expressions to solve Sudoku in Microsoft SQL Server — rather, I’m trying to replicate the way I’d solve a Sudoku with a pen and paper.

Some parts of my code involve representing the combination of possible values of each empty Sudoku cell as a single integer, by using a single bit (‘binary digit’) for each of the values 1 to 9. This makes some of the coding much easier than it otherwise would be. Encoding binary information as integers, and then querying that binary data, isn’t something that’s widely done within SQL, but it’s occasionally the nicest* way of attacking a problem.

* For some value of ‘nice’, please argue amongst yourselves(!)


Very quick refresher: binary

(If you know about binary, you can skip this section.)

A binary representation of a number uses just the two digits zero and one. For a string of binary digits, the nth digit from the right represents the decimal value 2^(n-1) — where in standard mathematical notation, the caret character ‘^’ means ‘to the power of’. Annoyingly, as we’ll see shortly, ‘^’ means something different in SQL. Two to the power of (n-1) in SQL is calculated with the function POWER(2,n-1).

Example: 110 in binary is equivalent to 1 * POWER(2,2) + 1 * POWER(2,1) + 0 * POWER(2, 0) = 6 in decimal. Note that we didn’t need to consider the last part, 0 * POWER(2, 0), as zero times anything is zero. Accordingly, 1000001 is POWER(2,6) + POWER(2,0) = 64 + 1 = 65.

To manipulate binary numbers, there are standard ‘logical operators’ AND, OR, EOR that take two binary numbers as input and return a binary number as output; see Bitwise operation at Wikipedia for more detail. The operator NOT takes one input and simply ‘flips’ the bits such that 0 becomes 1, and 1 becomes 0. We don’t have to explicitly convert our numbers to binary to use these operators, as integers are already stored as binary numbers in memory.

If you want to convert between binary (aka ‘base 2’) and decimal (‘base 10’), you can use the functions from my previous post, Encrypting IDs using a custom alphabet:

SELECT dbo.fn_BaseEncrypt(65, '01')
1000001
SELECT dbo.fn_BaseDecrypt('1000001', '01')
65


(Non-Sudoku) Example

We have 7 employees (identified by EmployeeID) that work various days of the week. We represent the data in two different datasets: in the first, the data is normalised, and each row contains the EmployeeID, and a Day value to represent which day that employee works: 1 = Sunday, 2 = Monday, …, 7 = Saturday; one row per day. In the second dataset, there’s only one row per employee, and a single integer indicates the combination of days: the 1st (right-most) bit is ‘1’ if they work Sunday, ‘0’ if not; the 2nd bit is ‘1’ if they work Monday, ‘0’ if not; etc, up to the 7th bit representing Saturday. This integer can take all values from 0 (no days worked) to 127 (all days worked).

Using our datasets, we want to answer some straightforward questions:

  1. Which employees work on Wednesday?
  2. Which employees work on Monday and Wednesday?
  3. Which employees work on Wednesday but not Monday?
  4. Which employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?
  5. Which employees work 5 or more days in a week?

The way in which these questions are answered using the two different datasets (normalised and binary) will demonstrate how using binary data can sometimes lead to simpler code.


Normalised data

Here’s our normalised data, one record per EmployeeID / Day:

DROP TABLE IF EXISTS #DaysWorked
GO

CREATE TABLE #DaysWorked
(
	EmployeeID INT NOT NULL
	,[Day] TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID, [Day])
)
GO

INSERT #DaysWorked(EmployeeID, [Day])
	VALUES
	 (10,2),(10,3),(10,4),(10,5),(10,6)		-- Mon to Fri
	,(20,2),(20,3),(20,4),(20,5)			-- Mon to Thu
	,(30,6),(30,7),(30,1)				-- Fri, Sat, Sun
	,(40,3),(40,4),(40,5)				-- Tue, Wed, Thu
	,(50,2),(50,3),(50,4),(50,5),(50,6),(50,7)	-- Mon to Sat
	,(60,3),(60,4),(60,5),(60,6),(60,7)		-- Tue to Sat
	,(70,1)						-- Sun only
GO

Binary data

Here’s the same data, but using single a integer per employee, DayBits, to represent the collection of days worked:

DROP TABLE IF EXISTS #DaysWorkedBinary
GO

CREATE TABLE #DaysWorkedBinary
(
	EmployeeID INT NOT NULL
	,DayBits TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID)
	,INDEX IX_DaysWorkedBinary_DayBits (DayBits)
)
GO

INSERT #DaysWorkedBinary(EmployeeID, DayBits)
	VALUES		-- SFTWTMS
	 (10, 62)	-- 0111110
	,(20, 30)	-- 0011110
	,(30, 97)	-- 1100001
	,(40, 28)	-- 0011100
	,(50,126)	-- 1111110
	,(60,124)	-- 1111100
	,(70,  1)	-- 0000001
GO

As you can see from the comments in green, each day corresponds to a 0/1 bit, where the values of those bits are the powers of 2: from the right, 1,2,4,8,16,32 and 64.

For the sake of completeness — just how did we calculate the values of the DayBits field? Like this:

SELECT
		d.EmployeeID
		,DayBits = SUM(POWER(2, d.[Day]-1))
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	ORDER BY d.EmployeeID
GO

To go back the other way, turning the binary data into normalised, one row per employee/day:

SELECT
		d.EmployeeID
		,[Day] = v.n
	FROM #DaysWorkedBinary d
	JOIN (
		VALUES(1),(2),(3),(4),(5),(6),(7)
	) v(n)
		ON d.DayBits & POWER(2, v.n-1) > 0 
	ORDER BY d.EmployeeID, [Day]
GO

So, we have our two datasets (normalised data: #DaysWorked, and binary data: #DaysWorkedBinary), and we can start to answer our five questions.


Questions 1, 2 and 3

Normalised version

Answering question 1, “Which employees work on Wednesday?” is trivial:

SELECT EmployeeID
	FROM #DaysWorked
	WHERE [Day] = 4
	ORDER BY EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

For question 2, “Which employees work on Monday and Wednesday?”, we need to work a little harder:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
	WHERE d1.[Day] = 4 
	AND d2.[Day] = 2
	ORDER BY d1.EmployeeID
GO
EmployeeID
-----------
10
20
50

We can write this query in other ways:

-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day] IN (2,4),1,0)) = 2
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 1 AND [4] = 1
	ORDER BY EmployeeID

All four queries return the same result set. Note that a and b have identical execution plans; the plans for c and d aren’t identical, but they’re very similar to each other.

Question 3, “Which employees work on Wednesday but not Monday?”, is similar in construction to question 2:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	LEFT JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
		AND d2.[Day] = 2
	WHERE d1.[Day] = 4 
	AND d2.EmployeeID IS NULL
	ORDER BY d1.EmployeeID
-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID NOT IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day]=4,1,0)) = 1
	AND SUM(IIF(d.[Day]=2,1,0)) = 0
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 0 AND [4] = 1
	ORDER BY EmployeeID

Here, the execution plans for a and b are very similar, but not identical.

Binary version

Now we’ll answer the same questions, but using our binary dataset.

Question 1:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 8 = 8
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

The same results as before; the only real structural difference is in the WHERE clause. We’re using ‘&’ (AND) to pull out the records that match. The 8 is the 4th bit from the right (1,2,4,8), i.e. Wednesday. Our query returns those records for which the DayBits integer has the 4th bit ‘set’ (or ‘switched on’).

Question 2:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 10
	ORDER BY d.EmployeeID

Same premise as before, but we’re checking whether two bits (the 2nd and 4th, corresponding to Monday and Wednesday) are set.

Question 3:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 8 
	ORDER BY d.EmployeeID

Similar structure, but here we only want those records where the 4th bit is set (equal to 1), and the 2nd is unset (equal to zero).


Question 4

This is a trickier question: “Which pairs of employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?”.

Here’s my solution:

Normalised version

;WITH cte_CrossJoin AS
(
	SELECT
			EmployeeID1	= d1.EmployeeID 
			,EmployeeID2	= d2.EmployeeID
			,Emp1Days	= COUNT(DISTINCT d1.[Day])
			,Emp2Days	= COUNT(DISTINCT d2.[Day])
			,[Matched]	= SUM(IIF(d2.[Day] = d1.[Day], 1, 0))
		FROM #DaysWorked d1
		CROSS JOIN #DaysWorked d2
		WHERE d2.EmployeeID > d1.EmployeeID
		GROUP BY d1.EmployeeID , d2.EmployeeID
)
	SELECT EmployeeID1, EmployeeID2
		FROM cte_CrossJoin
		WHERE Emp1Days + Emp2Days = 7
		AND [Matched] = 0
GO
EmployeeID1 EmployeeID2
----------- -----------
20          30
50          70

(I have to admit, I went through several different attempts — this is number 4 or 5. It’s reasonably compact, but due to the CROSS JOIN, it’s more ‘brute-force’ than I’d like. If you have a more refined solution, please let me know!)

Binary version

The binary version is simpler:

SELECT
		Employee1 = d1.EmployeeID
		,Employee2 = d2.EmployeeID
	FROM #DaysWorkedBinary d1
	JOIN #DaysWorkedBinary d2
		ON d2.EmployeeID > d1.EmployeeID
	WHERE d1.DayBits ^ d2.DayBits = 127
GO

In SQL, the ‘^’ (caret) character means ‘EOR’, ‘exclusive OR’ (sometimes seen as ‘XOR’). The bits representing the days have to be different on each side, but they have to exist on at least one.


Question 5

The final question is “Which employees work 5 or more days in a week?”. This is trivial using the normalised data; and given that, so far, all the queries using the binary data are smaller, you’d expect that to be the case here too.

Normalised version

Very straightforward, we use a HAVING clause to restrict the number of records (in our case, employee/days) to greater than or equal to 5:

SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING COUNT(1) >= 5
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
50
60

Binary version

Sadly, the binary version isn’t particularly neat. This is one of the limitations of using binary data: there’s no native function to count how many bits are ‘on’ (set) in a given integer. (If there was, it would again be trivial.) We have to do the counting ourselves; first, a longhand version:

;;WITH cte_BitCount AS
(
	SELECT
			EmployeeID
			,BitCount =	  IIF(DayBits&1>0,1,0)  + IIF(DayBits&2>0,1,0)
					+ IIF(DayBits&4>0,1,0)  + IIF(DayBits&8>0,1,0)
					+ IIF(DayBits&16>0,1,0) + IIF(DayBits&32>0,1,0)
					+ IIF(DayBits&64>0,1,0) 
		FROM #DaysWorkedBinary
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID

GO

And here’s a more extensible way of doing the same calculation:

;;WITH cte_BitCount AS
(
	SELECT
			d.EmployeeID
			,BitCount = SUM(SIGN(d.DayBits & POWER(2,v.n-1)))
		FROM #DaysWorkedBinary d
		CROSS APPLY (
			VALUES(1),(2),(3),(4),(5),(6),(7)
		) v(n)
		GROUP BY d.EmployeeID
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID
GO	

but it’s still not as concise as the query over the normalised data.

Some further reading on counting bits:

And finally

I hope I’ve demonstrated that there are times when using binary / bit-wise logic makes your code more elegant and compact. It’s definitely useful in cases where there are fixed numbers of discrete ‘flags’ — units of time like days and months being good examples, as they’re never going to change — but you have to be careful not to implement these patterns without some consideration. For one, they may not be particularly friendly to colleagues who don’t have a maths/CS background. You can always swap between normalised/binary ‘on the fly’ (as per the queries near the top that convert one dataset into the other) if you need to.

There’s some more to be written about aggregating over binary data, as there are no in-built SQL functions for this; hopefully that’ll be a future post.

, , ,

Leave a comment

Encrypting IDs using a custom alphabet

Let’s say you want to track link clicks in a marketing email you’re sending out. The best way to track these is to put a GUID in all the links, so every click is uniquely identifiable. In order for this to work, you’ll need to have a record of all the GUIDs, stored alongside the destination email address (and/or customer ID). But if you’re sending many emails to millions of customers, that’s a lot of GUIDs to store.

If you’re not too worried about people fiddling with the links, you could encrypt an ID (e.g. CustomerID), and put that in the link instead of a GUID; that means you wouldn’t have to keep any record in your database, you could just decrypt the ID string next time you saw it. So let’s look at a way in which we can encrypt IDs… but, just to be clear, this is quite weak encryption, you shouldn’t use it for anything important. Only consider it if the worst thing that can happen as a result of someone cracking your code is mild inconvenience!

One way to encrypt an integer ID is to change its base, and use a randomised alphabet. Here’s the function I’m going to use:

CREATE OR ALTER FUNCTION dbo.fn_BaseEncrypt
(
	@base10 INT
	,@alphabet VARCHAR(255)
)
RETURNS VARCHAR(255)
AS
BEGIN

	DECLARE @newBaseLen INT = LEN(@alphabet)

	DECLARE @remaining INT, @result VARCHAR(255) = ''

	SET @remaining = @base10

	WHILE (@remaining > 0)
	BEGIN
		SET @result = SUBSTRING(@alphabet, 1 + (@remaining % @newBaseLen), 1) + @result
		SET @remaining = @remaining / @newBaseLen
	END

	RETURN(@result)

END
GO

(NB: None of the code presented here is bullet-proof, production-ready code. For clarity, I’ve left out things like checking the input parameters.)

It’s reasonably simple: there’s a loop that repeatedly modulos (that’s the ‘%’ character) and divides the starting number, adding the relevant indexed character from our alphabet each time.

Some simple examples:

SELECT
		[Description]
		,Alphabet
		,[Value]
		,Result = dbo.fn_BaseEncrypt([Value], Alphabet)
	FROM 
	(
		VALUES
			('Binary', '01', 252)
			,('Trinary', '012', 253)
			,('Octal', '01234567', 254)
			,('Base 9', '012345678', 1234567)
			,('Hexadecimal', '0123456789ABCDEF', 98765432)
	) x([Description], Alphabet, [Value])
GO
Description Alphabet         Value       Result
----------- ---------------- ----------- ---------
Binary      01               252         11111100
Trinary     012              253         100101
Octal       01234567         254         376
Base 9      012345678        1234567     2281451
Hexadecimal 0123456789ABCDEF 98765432    5E30A78

Don’t take my word for it that these values are correct! For the hexadecimal one, you can type '98765432 in hex' straight into your address bar in Google Chrome. For the others, there are sites like ExtraConversion and RapidTables.

Here’s the accompanying decryption function:

CREATE OR ALTER FUNCTION dbo.fn_BaseDecrypt
(
	@strValue VARCHAR(255)
	,@alphabet VARCHAR(255)
)
RETURNS INT
AS
BEGIN
	DECLARE @newBaseLen INT = LEN(@alphabet)
	
	DECLARE @result INT = 0, @m INT = 1, @i INT, @l INT = LEN(@strValue)
	DECLARE @revStrValue VARCHAR(255) = REVERSE(@strValue)
	SET @i = @l
	WHILE(@i > 0)
	BEGIN
		SET @result = @result + @m * (CHARINDEX(SUBSTRING(@revstrValue, @l-@i+1, 1) , @alphabet) - 1)
		SET @m = @m * @newBaseLen
		SET @i = @i - 1
	END
	RETURN(@result)
END
GO

Now, clearly anyone can ‘decrypt’ our example strings, because the ‘alphabet’ is obvious in each case. But what if we choose a completely different set of characters, like A-Z and 0-9, randomised?

-- replace the VALUES(...):
VALUES
	 ('Random #1', '1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y', 123456789)
	,('Random #2', 'C9VGTJ34ZXHMKEWUA28DNP6FYRB', 123456789)
Description Alphabet                             Value       Result
----------- ------------------------------------ ----------- ------
Random #1   1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y 123456789   3L6VXI
Random #2   C9VGTJ34ZXHMKEWUA28DNP6FYRB          123456789   ZAZ3F8

For ‘Random #2’, I’ve removed the characters 0,O,I,1,7,L,5,S and Q. You might want to do this if the resulting string is likely to be read back to you over the phone.

Aside: quick way of generating a randomised alphabet, using R:

paste(sample(c(LETTERS, 0:9)),collapse="")

So, if someone was to see a string of characters like ‘3L6VXI’, it would be meaningless to them; they don’t have the original alphabet in order to decode it.

However, keeping with the ‘Random #2’ alphabet, what if we look at encryptions of consecutive values:

Value       Result
----------- ------
10001       EDM
10002       EDK
10003       EDE
10004       EDW
10005       EDU
10006       EDA
10007       ED2
10008       ED8
10009       EDD
10010       EDN
10011       EDP
10012       ED6
10013       EDF
10014       EDY
10015       EDR
10016       EDB
10017       ENC
10018       EN9
10019       ENV
10020       ENG

Given enough examples of an encrypted ID, any competent ‘hacker’ is going to be able to deduce the generating alphabet. Instead of using sequential IDs, you could use some easily reversible function: 11 * {ID} + 13, for example. But we can get more creative than that! Here are some ways you could further obfuscate the string:

  • Reverse it
  • Rotate it by n characters to the left/right
  • Re-order the string entirely (with a fixed re-ordering)
  • Add spacing characters, e.g. a hyphen, at meaningless positions
  • Add random characters at pre-defined positions
  • Add one or more check digits – also consider SQL’s CHECKSUM() function.
  • Use more than one encryption/decryption alphabet – maybe part of the string indicates which alphabet to use?
  • Concatenate / interleave multiple encrypted strings – you might have one string for CustomerID, one that identifies the particular email template, and one for the link location within the email

Of course, you MUST check that your end-to-end encryption/decryption process works on all inputs you’re likely to put through it! Given that your PC can probably work through millions of encrypt/decrypt cycles per second, there’s no reason not to check that your functions will work without error in the future.

, , , ,

Leave a comment

Interest and APR: a short explainer

I needed recently to check the monthly payment amount for a simple loan; choosing four online loan calculators at random, I got two slightly different values. In this PDF document Interest rates & APR (v0.1), I’ve detailed the differences (spoiler: it’s due to APR) and shown how the figures from the online calculators were arrived at. As ever, I hope someone finds it useful!

A plea: If you know of any book that goes into the various types of loan scheme in some depth, in a fairly formal and systematic way, please do let me know; it’s an area I find fascinating, yet there doesn’t seem to be a single definitive online resource with quite enough depth.

Finally, I must sing the praises of Wolfram Alpha, it’s been invaluable for my work for the past couple of years. Between Wolfram Alpha, wxMaxima and R, they’ve been able to solve any equation I’ve had cause to throw at them!

, , , ,

Leave a comment

Using PowerShell to restore .bak files to Azure SQL

PowerShell

“This should be easy!”

The mini-project I’m working on this week requires us to restore third-party data from SQL Server database backup files (.bak, spoken as ‘dot back’, files) into our main Azure SQL instance — once per day, every day. Somewhat surprisingly, this isn’t straightforward: you can’t simply point one of the Azure tools at a .bak file and restore it to your cloud SQL Server instance.

If you don’t believe me, here are some examples of people asking the question and being disappointed:

We even used up some of our support credits to ask a proper expert – he gave a wry chuckle and said that this (the ability to restore .bak files) was the most frequent request in the Azure SQL community!


Therefore, we had to ‘roll our own’ process. With a new Windows box built especially for this project, we needed code to do the following sub-tasks:

  1. SFTP the compressed and encrypted files from the third parties to our box
  2. Decompress and unpack the .bak files
  3. Restore the .bak files to SQL Server Express (installed locally)
  4. Export (some of) the data to Azure

What was the best way forward? Well, it’s PowerShell, no? (Let me know if you think otherwise!)

Although I’ve got (too many years of) experience using old-school .BAT and VBScript files in Windows (which still work, but are decidedly ‘old hat’ now), I’ve never really used PowerShell before, but it seemed to be absolutely the right ‘glue’ to get this project working. To keep things simple, I decided to just use Notepad++ to create/edit the scripts (which have a .ps1 file extension), no IDE or anything.

For each of the first three steps above, I’m just going to present some, hopefully useful, notes. The fourth step, exporting the data from SQL Server Express to Azure — I didn’t do that bit; I believe the colleague who was assigned used a client-side version of ADF. (I don’t really know ADF very well at all.)


1. SFTP-ing

(SFTP is basically ‘secure FTP’ – that is, transferring a file over the internet, but securely. Officially, it looks like SFTP stands for ‘SSH File Transfer Protocol’.)

WinSCP

There’s a very popular Windows SFTP client WinSCP. The GUI is a wrapper around a powerful .NET assembly (in old money: a DLL), and we can use this at run-time in our PowerShell script. On my machine, the DLL resides here: C:\Program Files (x86)\WinSCP\WinSCPnet.dll

I pretty much copied the code from here: Help with SFTP Download Script

Now, one of the properties of the WinSCP object that we had to set in code was SshHostKeyFingerprint. I confess: I didn’t know what that was. This link About the SSH host key fingerprint says:

“Every SSH server is configured to use a host key to verify that the client is connecting to the correct host. The SSH server administrator provides the host key fingerprint to the various clients. The clients are expected to manually verify the host key while connecting to the server using any SSH client.”

Effectively, it’s just another piece of information to ensure that you’re connecting to the correct server.

Where do we get this key from? There’s a PowerShell function here: Get-SshFingerprint that does exactly what you want; you give it a domain name, and it returns the SSH host key fingerprint.

When I ran it on our third-party hosts, I got results like the following:

ssh-ed25519 255 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

or

ssh-rsa 2048 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(where I’ve xxx-ed out the real values) So far, so good. One of our third parties uses old-school username/password connection details, that look something like this:

HostName		= "sftp.thirdpartyA.com"
UserName		= "pete"
Password		= "s00pers4f3p455w0rd"
SshHostKeyFingerprint	= "ssh-rsa 2048 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Given the SshHostKeyFingerprint, the SFTP transfer worked perfectly. Now for a minor complication: another third party gave us some slightly different details: a .ppk file and a ‘passphrase’. These worked fine when doing the SFTP transfer manually in FileZilla (that is, using a GUI). But it’s ok, because our WinSCP object has the requisite properties (in bold):

HostName		= "sftp.thirdpartyB.com"
UserName		= "pete"
SshPrivateKeyPath	= "very_private_key.ppk"
PrivateKeyPassphrase	= "v3rys3cr3tp455phr4s3"
SshHostKeyFingerprint	= "ssh-ed25519 255 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 

But when we ran this code, it errored out. We looked in the WinSCP log file:

"Unable to use key file "very_private_key.ppk" (OpenSSH SSH-2 private key (old PEM format))"

It turns out that the .ppk file we’d been given contained the key in an old format, and WinSCP required a newer format. You have to convert the ‘OpenSSH’ key to a ‘PuTTY’ key; helpfully, when you install WinSCP on Windows, you get a tool for doing the conversion, called puttygen.exe. On my PC, it’s at C:\Program Files (x86)\WinSCP\PuTTY\puttygen.exe. It’s a GUI, not command-line, but it’s very straightforward: you open the old .ppk file, then save it as a new version.


2. Decompressing and unpacking the .bak files

The popular zipping program 7zip has its own PowerShell module; see Stackoverflow unzip file using 7z in powershell. Then in the PowerShell script, it’s as simple as:

Expand-7Zip -ArchiveFileName $sourceFile -TargetPath $destinationFolder

Once the files were unzipped, we had to decrypt them using a command-line version of RedGate’s SQB2MTF utility. This was straightforward, the only thing to note was that you can run PowerShell scripts from inside other scripts by doing:

& $path_to_SQB2MTF_exe $parameters

where the ampersand is the ‘call operator’ — see What does the & symbol in powershell mean?


3. Restoring .bak files to SQL Server Express

There’s a native PowerShell command Restore-SqlDatabase that “Restores a database from a backup or transaction log records”. Perfect… except it didn’t work. The command tried to restore the database files to the exact same file locations they’d been backed up from, e.g. S:\DATA\ThirdPartyDB.MDF — this wouldn’t ever work, our database drive set-up is wholly different.

It looks like there are workarounds to this, but it involves quite a lot of intimidating-looking code. Luckily, someone mentioned the module dbatools, which was a godsend. Once we’d installed it with:

Install-Module dbatools

, the code in the script look like this:

$Server		= "OURMACHINE\SQLEXPRESS"
$BAKFolder	= "D:\Location_Of_Our_BAK_files\"
$SQLDataFolder	= "C:\Program Files\Microsoft SQL Server\MSSQL15.SQLEXPRESS\MSSQL\DATA"

Restore-DbaDatabase -SqlInstance $Server -Path $BAKFolder -WithReplace -DestinationDataDirectory $SQLDataFolder

Basically one line of code! dbatools looks fantastically powerful, I’ll almost certainly be using it again for other projects.


Scheduling the scripts

The final piece of the puzzle was to get these scripts to run on a schedule (usually once every day, early in the morning), without a user being logged on to the machine. Unfortunately, SQL Server Express doesn’t come with (the otherwise indispensable) SQL Server Agent. Therefore, we have to use Windows’ inferior Task Scheduler. See e.g. How to Automate PowerShell Scripts with Task Scheduler to learn how to set up a new scheduled task. In order to get this to work, we needed the following details in the ‘Edit Action’ box:

Program/script: powershell.exe
Add arguments: -noprofile -executionpolicy bypass -file C:\OurScriptFolder\OurScript.ps1

(taken from Stackoverflow: Powershell script does not run via Scheduled Tasks)

Piece of cake, right? Yeah, that didn’t go too smoothly either — two things that it’s critical to take into account:

  1. Tasks are ‘run as’ a user of your choice. Pick the wrong user, and [all or maybe part of] your task doesn’t run….
  2. … and, unfortunately, Task Scheduler is terrible at feedback. From the GUI, it looks like your task ran successfully, when in fact, nothing happened. You have to do a lot of logging to a file to work out where your script stops working.

After most of a working day, and much tearing of hair, let me share the following with you:

  • The 7Zip4PowerShell module (which contains the Expand-7Zip command) won’t run under the SYSTEM account; we had to create a new user, and install the module (in PowerShell) while logged in as that account.
  • If it’s all part of the same task, this new user will need appropriate permissions to restore databases in SQL Server Express.
  • Just because the task runs a script that lives in a certain folder, that doesn’t mean the context is set to that folder. We had to put a Set-Location “C:\OurScriptFolder\” command at the top of the script that was called by the task.

End notes

In total, it took just over a day to pull all the code together; PowerShell is fairly easy to write and debug, and nearly everything I needed to do, someone else had already done it, so it was only a matter of googling. Really, the only original code I wrote was to do with keeping track of progress, logging results to files, etc.

Setting up the scheduled task was a pain, but with some patience (and a little help from a colleague), I got there in the end!

, , , , , , , ,

Leave a comment

NULL doesn’t look like anything (to me)

A quick post to mention something that nearly caught me out recently. I was re-working some old code (author no longer at the company) and came across SQL similar in intent to the below: (I’ve made the code more exciting by using data about fancy cars, rather than phone systems!)

;WITH cte_Models AS
(
	SELECT ModelID, Make, ModelName
		FROM (
			VALUES
			 (1,'Lancia','Flavia')
			,(2,'Lancia','Gamma')
			,(3,'Lancia','Prisma')
			,(4,'Lotus','Elise')
			,(5,'Lotus','Evora')
			,(6,'Lotus','Exige')
			,(7,'Lotus','Elan')
			,(8,'Land Rover','Discovery')
			,(9,'Land Rover','Defender')
			,(10,'Lamborghini','Aventador')
			,(11,'Lamborghini','Huracan')
			,(12,'Lamborghini','Urus')
			,(13,'Lamborghini','Asterion')
			,(14,'Lamborghini','Sian')
			,(15,'Lamborghini','Estoque')
			,(16,'Chrysler','Conquest')
			,(17,'Chrysler','Galant')
			,(18,'Chrysler','Imperial')
			,(19,'Chrysler','Windsor')
		) x(ModelID, Make, ModelName)

), cte_Country AS
(
	SELECT
			Make, Country
		FROM (
			VALUES
				 ('Lancia', 'Italy')
				,('Lotus', 'United Kingdom')
				,('Land Rover', 'United Kingdom')
				,('Chrysler', 'United States')
				-- Note: No entry for 'Lamborghini' (which would be 'Italy')
		) x(Make, Country)
)
	SELECT
			m.ModelID
			,m.Make
			,m.ModelName
			,c.Country
		FROM cte_Models m
		LEFT JOIN cte_Country c
			ON c.Make = m.Make
		WHERE c.Country NOT LIKE 'Un%'
		ORDER BY m.ModelID

The code that gave me pause was the LEFT JOIN, with a NOT LIKE on a field from that table in the WHERE clause (the bit in bold). What does that give us?

Here’s what I expected to see:

ModelID     Make        ModelName Country
----------- ----------- --------- ----------
1           Lancia      Flavia    Italy
2           Lancia      Gamma     Italy
3           Lancia      Prisma    Italy
10          Lamborghini Aventador NULL
11          Lamborghini Huracan   NULL
12          Lamborghini Urus      NULL
13          Lamborghini Asterion  NULL
14          Lamborghini Sian      NULL
15          Lamborghini Estoque   NULL

But if you run the code, here’s what you actually get:

ModelID     Make        ModelName Country
----------- ----------- --------- -------
1           Lancia      Flavia    Italy
2           Lancia      Gamma     Italy
3           Lancia      Prisma    Italy

Where are the entries for ‘Lamborghini’?

I suspect many people will already know the answer: c.Country is NULL for Lamborghini (because there’s no match on Make), and NULL has its own rules when it comes to SQL comparison functions such as LIKE and NOT LIKE. In our code, the NOT LIKE comparison evaluates to a false, and the records are excluded.

To make it more explicit:

SELECT
	[LIKE]		= IIF(NULL     LIKE 'Un%', 1, 0)
	,[NOT LIKE]	= IIF(NULL NOT LIKE 'Un%', 1, 0)

LIKE        NOT LIKE
----------- ---------
0           0 

If we were comparing a non-NULL string, then it would have to result in one of those bits being a one. But NULL isn’t a string, and you have to be careful! (See this stackoverflow post for more information.)

Here’s the extra line of SQL we’d need if we wanted our Lamborghini records returned:

...
LEFT JOIN cte_Country c
	ON c.Make = m.Make
WHERE c.Country NOT LIKE 'Un%'
OR c.Country IS NULL
ORDER BY m.ModelID

Sad note: I’m not entirely sure the original code took this into account, so it could very well be wrong, and therefore returning an incomplete recordset. But there are zero comments in the code, so I can’t tell whether the output is as-intended or not! 🙃

, ,

Leave a comment

Re-creating complete records from audit data

Quite often, you’ll find that changes to data in a given table have been recorded in a separate table, with fields like this:

[field name], [timestamp], [old value], [new value]

Or maybe the data changes for ALL tables are recorded in the same place, in which case we’d have a column at the start telling us what object had been updated:

[object id], [field name], [timestamp], [old value], [new value]

Let’s say we have a Customer table, and the changes are recorded in a separate audit table. So if, say, the first name, date of birth, and postcode for a customer with ID 1234 were changed at the same time, we’d expect to see these records in our audit table:

CustomerID  FieldName   Timestamp         OldValue   NewValue
----------- ----------- ----------------- ---------- ----------
1234        FirstName   20200708 09:58:11 Rich       Richard
1234        DateOfBirth 20200708 09:58:11 1981-04-03 1981-04-02
1234        Postcode    20200708 09:58:11 SW1 1AA    SW1 2AA

It’s very common to want to re-format this data so that we get one row (resembling the original Customer record) per change, such that the row contains all the fields as they were at the time of the change. How do we do that? If we PIVOT just the new values, we get:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-02  SW1 2AA

Looks ok. (We need to generate a record for the old values, but we’ll worry about that later.)

But what if the timestamps weren’t all the same – i.e. the three changes didn’t happen at the same time – and we did the PIVOT?

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    NULL        NULL
1234        20200709 09:58:12 NULL       1981-04-02  NULL
1234        20200710 09:58:13 NULL       NULL        SW1 2AA

This isn’t what we want, those NULLs shouldn’t be there. The above is just showing the new values for the field that’s changed. What we really want is this:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

which shows ALL the fields as they were at the given date/time. And maybe an initial record, prior to any changes:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20010101 00:00:00 Rich       1981-04-03  SW1 1AA 
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

Obviously, in the data as shown, we have no idea what the original timestamp was; we need to use something that makes sense in context, like the date the original Customer record was written for the first time. (I’ve only used 1st Jan 2001 here to make the substitution obvious.)

How can we achieve this?

Worked example

Let’s work through a solution, using example data. Our data consists of two people who have an id, a height, a credit score, and a postcode.

IF OBJECT_ID('tempdb.dbo.#Changes','U') IS NOT NULL
BEGIN
	DROP TABLE #Changes
END
GO

SELECT
		PersonID	= CAST(PersonID AS INT)
		,[Date]		= CAST([Date] AS DATE)
		,[Field]
		,[From]
		,[To]
	INTO #Changes
	FROM (
		VALUES
			 (101, '2010Jan01', 'Height', '160', '161')
			,(101, '2010Jan01', 'CreditScore', '541', '542')
			,(101, '2010Feb02', 'Height', '161', '162')
			,(101, '2010Feb03', 'CreditScore', '542', '538')
			,(101, '2011Mar03', 'Height', '162', '163')
			,(101, '2011Mar03', 'CreditScore', '538', '536')
			,(101, '2012Dec30', 'Postcode', 'W1A 1AA', 'SW1A 2AA')
			,(102, '2015Jul15', 'Postcode', 'SW1A 1AA', 'SE1 9TG')
			,(102, '2015Aug21', 'Postcode', 'SE1 9TG', 'SE1 9DT')
			,(102, '2016Apr04', 'CreditScore', '602', '606')
			,(102, '2017May05', 'Height', '150', '148')
	) x(PersonID, [Date], [Field], [From], [To])
GO

SELECT * FROM #Changes
GO
PersonID    Date       Field       From     To
----------- ---------- ----------- -------- --------
101         2010-01-01 Height      160      161
101         2010-01-01 CreditScore 541      542
101         2010-02-02 Height      161      162
101         2010-02-03 CreditScore 542      538
101         2011-03-03 Height      162      163
101         2011-03-03 CreditScore 538      536
101         2012-12-30 Postcode    W1A 1AA  SW1A 2AA
102         2015-07-15 Postcode    SW1A 1AA SE1 9TG
102         2015-08-21 Postcode    SE1 9TG  SE1 9DT
102         2016-04-04 CreditScore 602      606
102         2017-05-05 Height      150      148

Note that the From/To fields are strings; that’s the only way to record this data for fields that could be of any type. (We’ll have to remember to convert the fields back to their proper datatypes when we’re done.)

How many records are we expecting in our final recordset?

SELECT PersonID, [Date]
FROM #Changes
GROUP BY PersonID, [Date]
PersonID    Date
----------- ----------
101         2010-01-01
101         2010-02-02
101         2010-02-03
101         2011-03-03
101         2012-12-30
102         2015-07-15
102         2015-08-21
102         2016-04-04
102         2017-05-05

For PersonID = 1, our final result set will have 5 rows, plus 1 for our initial record (prior to any changes), a total of 6 rows. For PersonID 2, we’ll have 5 (4+1).

First, let’s expand the data out so we have a row per change per field (plus our initial ‘pre-change’ row):

IF OBJECT_ID('tempdb.dbo.#Expanded','U') IS NOT NULL
BEGIN
	DROP TABLE #Expanded
END
GO

;WITH cte_AllChanges AS
(
	SELECT
			x.PersonID
			,x.[Date]
			,y.[Field]
			,c.[To] AS [Value]
		FROM (
			-- All the possible PersonIDs and Dates:
			SELECT PersonID, [Date]
				FROM #Changes
				GROUP BY PersonID, [Date]
		) x
		-- All the possible fields:
		CROSS JOIN (
			SELECT [Field]
				FROM #Changes
				GROUP BY [Field]
		) y
		-- Join on the values we know about:
		LEFT JOIN #Changes c
			ON c.PersonID = x.PersonID
			AND c.[Date] = x.[Date]
			AND c.[Field] = y.[Field]

), cte_InitialRows AS
(
	-- The 'initial' rows for each of our PersonIDs
	SELECT
			PersonID
			,'2001Jan01' AS [Date]
			,[Field]
			,[From] AS [Value]
			,rn = ROW_NUMBER() OVER (PARTITION BY PersonID, [Field] ORDER BY [Date])
		FROM #Changes
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		INTO #Expanded
		FROM cte_AllChanges
	
	UNION ALL

	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		FROM cte_InitialRows
		WHERE rn = 1
GO

SELECT * FROM #Expanded
ORDER BY PersonID, [Date], Field
GO
PersonID    Date       Field       Value
----------- ---------- ----------- --------
101         2001-01-01 CreditScore 541
101         2001-01-01 Height      160
101         2001-01-01 Postcode    W1A 1AA
101         2010-01-01 CreditScore 542
101         2010-01-01 Height      161
101         2010-01-01 Postcode    NULL
101         2010-02-02 CreditScore NULL
101         2010-02-02 Height      162
101         2010-02-02 Postcode    NULL
101         2010-02-03 CreditScore 538
101         2010-02-03 Height      NULL
101         2010-02-03 Postcode    NULL
101         2011-03-03 CreditScore 536
101         2011-03-03 Height      163
101         2011-03-03 Postcode    NULL
101         2012-12-30 CreditScore NULL
101         2012-12-30 Height      NULL
101         2012-12-30 Postcode    SW1A 2AA
102         2001-01-01 CreditScore 602
102         2001-01-01 Height      150
102         2001-01-01 Postcode    SW1A 1AA
102         2015-07-15 CreditScore NULL
102         2015-07-15 Height      NULL
102         2015-07-15 Postcode    SE1 9TG
102         2015-08-21 CreditScore NULL
102         2015-08-21 Height      NULL
102         2015-08-21 Postcode    SE1 9DT
102         2016-04-04 CreditScore 606
102         2016-04-04 Height      NULL
102         2016-04-04 Postcode    NULL
102         2017-05-05 CreditScore NULL
102         2017-05-05 Height      148
102         2017-05-05 Postcode    NULL

(33 rows affected)

Now for the clever bit. For each of the NULLs in the recordset above, we want to fill in its last known value; that is, whatever preceding value we had for that field that wasn’t a NULL. On the face of it, it’s not a trivial requirement, there’s no nice in-built window function to help us. Luckily, there’s a great blog post by Itzik Ben-Gan: The Last non NULL Puzzle that gives us the answer. (I have it book-marked, I use it a lot!)

Adapting Itzik Ben-Gan’s code from that website, we have:

IF OBJECT_ID('tempdb.dbo.#LastKnownNull','U') IS NOT NULL
BEGIN
	DROP TABLE #LastKnownNull
END
GO

;WITH cte_LastNonNull AS
(
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,grp = MAX(CASE WHEN [Value] IS NOT NULL THEN [Date] END)
					OVER (
						PARTITION BY PersonID, [Field]
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
		FROM #Expanded
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,[Last non-NULL Value] = MAX([Value])
					OVER (
						PARTITION BY PersonID, [Field], grp
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
	INTO #LastKnownNull
	FROM cte_LastNonNull
GO

SELECT * FROM #LastKnownNull
GO
PersonID    Date       Field       Value    Last non-NULL Value
----------- ---------- ----------- -------- -------------------
101         2001-01-01 CreditScore 541      541
101         2001-01-01 Height      160      160
101         2001-01-01 Postcode    W1A 1AA  W1A 1AA
101         2010-01-01 CreditScore 542      542
101         2010-01-01 Height      161      161
101         2010-01-01 Postcode    NULL     W1A 1AA
101         2010-02-02 CreditScore NULL     542
101         2010-02-02 Height      162      162
101         2010-02-02 Postcode    NULL     W1A 1AA
101         2010-02-03 CreditScore 538      538
101         2010-02-03 Height      NULL     162
101         2010-02-03 Postcode    NULL     W1A 1AA
101         2011-03-03 CreditScore 536      536
101         2011-03-03 Height      163      163
101         2011-03-03 Postcode    NULL     W1A 1AA
101         2012-12-30 CreditScore NULL     536
101         2012-12-30 Height      NULL     163
101         2012-12-30 Postcode    SW1A 2AA SW1A 2AA
102         2001-01-01 CreditScore 602      602
102         2001-01-01 Height      150      150
102         2001-01-01 Postcode    SW1A 1AA SW1A 1AA
102         2015-07-15 CreditScore NULL     602
102         2015-07-15 Height      NULL     150
102         2015-07-15 Postcode    SE1 9TG  SE1 9TG
102         2015-08-21 CreditScore NULL     602
102         2015-08-21 Height      NULL     150
102         2015-08-21 Postcode    SE1 9DT  SE1 9DT
102         2016-04-04 CreditScore 606      606
102         2016-04-04 Height      NULL     150
102         2016-04-04 Postcode    NULL     SE1 9DT
102         2017-05-05 CreditScore NULL     606
102         2017-05-05 Height      148      148
102         2017-05-05 Postcode    NULL     SE1 9DT

(33 rows affected)

All that’s left to do is PIVOT up these values into one row per PersonID/Date:

SELECT *
	FROM (
		SELECT
				PersonID, [Date], [Field], [Last non-NULL Value] AS [Value]
			FROM #LastKnownNull
	) x
	PIVOT
	(
		MIN([Value]) FOR [Field] IN ([CreditScore],[Height],[Postcode])
	) p
	ORDER BY PersonID, [Date]
GO
PersonID    Date       CreditScore Height   Postcode
----------- ---------- ----------- -------- --------
101         2001-01-01 541         160      W1A 1AA
101         2010-01-01 542         161      W1A 1AA
101         2010-02-02 542         162      W1A 1AA
101         2010-02-03 538         162      W1A 1AA
101         2011-03-03 536         163      W1A 1AA
101         2012-12-30 536         163      SW1A 2AA
102         2001-01-01 602         150      SW1A 1AA
102         2015-07-15 602         150      SE1 9TG
102         2015-08-21 602         150      SE1 9DT
102         2016-04-04 606         150      SE1 9DT
102         2017-05-05 606         148      SE1 9DT

And we’re done: from a list of changes, we’ve recreated the record as it was at the time of each change. This will allow to us to more easily write queries concerned with how our data looked at any given point in time.

, , , ,

Leave a comment