Turning integers into rational factorial expressions

Caution! Maths up ahead!

I have a long-term side project that involves dealing with rational factorial expressions, that reduce to (often fairly large) integers, e.g.:

\frac{20! 18! 11!}{15! 12! 9!} = 2735405697024000

Typically, I’ll start with an integer (a result of a sum of combinatorial expressions), and require to turn it back into a purely factorial expression. My initial approach was to create a database of these integers: loop over all the possible permutations of factorials, and store the results. Here’s a snippet of the data:

(17!.17!.9!.7!)/(13!.7!.1!) = 7372584295195056537600000
(17!.17!.9!.7!)/(13!.7!.2!) = 3686292147597528268800000
(17!.17!.9!.7!)/(13!.7!.3!) = 1228764049199176089600000
(17!.17!.9!.7!)/(13!.7!.4!) = 307191012299794022400000
(17!.17!.9!.7!)/(13!.7!.5!) = 61438202459958804480000
(17!.17!.9!.7!)/(13!.7!.6!) = 10239700409993134080000

As you’d expect, this database got huge very quickly; this one particular formulation (a product of four factorials over a product of three factorials, using values up to 20) results in a table of 13 million rows, 643 MB in size. Although it works in principle, it’s not terribly practical.

Now, from our O-level Maths days, we know that all integers can be factorised into powers of primes, e.g.:

2735405697024000 = 2^{14} \times 3^{5} \times 5^{3} \times 7^{1} \times 11^{1} \times 13^{1} \times 17^{2} \times 19^{1}

The statement that every integer greater than 1 is either a prime, or a product of primes, is known as the Fundamental theorem of arithmetic, due to Gauss. For the above integer, I calculated the exponents in R using the factorize function from the gmp package, which is used for manipulating big integers and rationals of any size.

table(as.integer(factorize(2735405697024000)))
 2  3  5  7 11 13 17 19 
14  5  3  1  1  1  2  1 

If we factorise our six factorials in the first identity similarly, and tabulate them nicely, we get:

            (power of)
N!      2  3  5  7 11 13 17 19 
------------------------------
20! =  18  8  4  2  1  1  1  1 
18! =  16  8  3  2  1  1  1  0
11! =   8  4  2  1  1  0  0  0
------------------------------
15! =  11  6  3  2  1  1  0  0
12! =  10  5  2  1  1  0  0  0
 9! =   7  4  1  1  0  0  0  0
------------------------------
ans.=  14  5  3  1  1  1  2  1

For each power, the sums of the columns in the top block (the numerator of the LHS of the identity), minus those in the bottom block (denominator), give the power in the answer, e.g. in the 2 column, (18+16+8) – (11+10+7) = 14, which is the power of 2 in the result.

So if we want to turn an integer into a rational factorial expression, we just need a way of determining which linear combinations of the powers of our factorized factorials will give us the same powers as our target integer. For this, we need to turn our prime exponents into a matrix, and compute the Reduced row echelon form; as ever, there’s an existing R package that contains a function we can use!

Worked example

In order to demonstrate the process, we’ll keep the numbers small. Let’s say we have an integer, 120960, and we want to know if it can be turned into a rational factorial expression — it may not be possible.

120960 = 2^{7} \times 3^{3} \times 5^{1} \times 7^{1}

Here’s our table of factorial factors (which I’ve called factpp), and above it, the code I used to generate it:

require(gmp); # gmp library
maxprime <- 7;
pm <- primes(maxprime); # 2,3,5,7
npm <- length(pm);
factpp <- matrix(rep(0,(maxprime-1)*npm),nrow=(maxprime-1),ncol=npm); # factorial prime powers
for(i in 1:(maxprime-1))
{
	tv <- table(as.integer(factorize(factorial(i+1))));
	for(j in 1:npm) {
		snpm <- sprintf("%d",pm[j]);
		if (snpm %in% names(tv)) {
			factpp[i,j] <- tv[snpm];
		}
	}
}

factpp
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    1    1    0    0
[3,]    3    1    0    0
[4,]    3    1    1    0
[5,]    4    2    1    0
[6,]    4    2    1    1

(Note how the exponents are non-decreasing – a proof of this can be found here: Prime Factors of Factorial Numbers)

My code is pretty horrible (matching by converting ints to strings, yuck), and it can definitely be hugely improved – but I only need it once; when I’ve generated the prime factors for all the factorials I’m interested in, I don’t need to run it again.

We can check that the exponents make sense:

apply(factpp, 1, function(v) { (2^v[1])*(3^v[2])*(5^v[3])*(7^v[4])});
[1]    2    6   24  120  720 5040

These are clearly the values of 2! up to 7!. (We can ignore 1! = 1, we don’t need it.) Because the highest prime factor in 120960 is 7, we don’t need to consider factorials that have prime factors greater than 7.

To compute our reduced row echelon form, we require the function rref in the package pracma (see here for documentation).

require(pracma);
input <- cbind(t(factpp), c(7,3,1,1)); # bind factorial exponents to target exponents 
rref(input);
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    2    0    0    0    2
[2,]    0    1    1    0    1    0    1
[3,]    0    0    0    1    1    0    0
[4,]    0    0    0    0    0    1    1

This matrix is actually giving us a system of simultaneous equations, one equation per row; the values in the first 6 columns are the coefficients of the variables, and the final column on the right is the value on the right-hand side of the equals sign. To interpret this matrix (i.e. the output of rref), this video Interpreting RREF Augmented Matrices by MATHguide is succinct and useful.

With our reduced row echelon form, there are three possibilities:

  • There is exactly one solution
  • There are no solutions
  • There are infinitely many solutions

Because we have 4 equations and 6 variables, we know that we don’t have a single solution. From the above MATHguide video, we can see that the output of rref indicates we have infinitely many solutions – but that’s ok, it’s not going to cause us a problem.

From the output, our simultaneous equations can be written as:

a + 2c = 2
b + c + e = 1
d + e = 0
f = 1

which we can reduce to the following result vector with two variables:

(2-2c, 1-c-e, c, -e, e, 1)

So, for any integer values of c and e, we will get a valid result. The values of this vector indicate the power of the factorial in our rational expression. Let’s look at some examples:

 c   e  vector                  resulting factorial expression
--  --  ----------------------  ----------------------------------
 0   0  (2,  1,  0,  0, 0, 1)   2!2!3!7!
 1   0  (0,  0,  1,  0, 0, 1)   4!7!
 0   1  (2,  0,  0, -1, 1, 1)   (2!2!6!7!) / 5!
 1   1  (0, -1,  1, -1, 1, 1)   (4!6!7!) / (3!5!)
-2   2  (6,  1, -2, -2, 2, 1)   (2!2!2!2!2!2!3!6!7!) / (4!4!5!5!)

All of these are valid solutions (they all equal 120960), and we can choose whichever one best fits our needs.

(Why do we have more than one solution? Because simple variations of factorial products have the same values, e.g. 2!2!3! = 4! and 6!7! = 10!. If we removed rows from factpp — e.g. we removed the first row, relating to 2! — then we’d get fewer possible results.)

, , , , ,

Leave a comment

Practical queries using binary operations

One of my ongoing ‘fun’ projects is a Sudoku-solver, written using only SQL. I’m not interested in brute-forcing a solution — there already exists a beautiful (and tiny) piece of SQL to do that: see Using Common Table Expressions to solve Sudoku in Microsoft SQL Server — rather, I’m trying to replicate the way I’d solve a Sudoku with a pen and paper.

Some parts of my code involve representing the combination of possible values of each empty Sudoku cell as a single integer, by using a single bit (‘binary digit’) for each of the values 1 to 9. This makes some of the coding much easier than it otherwise would be. Encoding binary information as integers, and then querying that binary data, isn’t something that’s widely done within SQL, but it’s occasionally the nicest* way of attacking a problem.

* For some value of ‘nice’, please argue amongst yourselves(!)


Very quick refresher: binary

(If you know about binary, you can skip this section.)

A binary representation of a number uses just the two digits zero and one. For a string of binary digits, the nth digit from the right represents the decimal value 2^(n-1) — where in standard mathematical notation, the caret character ‘^’ means ‘to the power of’. Annoyingly, as we’ll see shortly, ‘^’ means something different in SQL. Two to the power of (n-1) in SQL is calculated with the function POWER(2,n-1).

Example: 110 in binary is equivalent to 1 * POWER(2,2) + 1 * POWER(2,1) + 0 * POWER(2, 0) = 6 in decimal. Note that we didn’t need to consider the last part, 0 * POWER(2, 0), as zero times anything is zero. Accordingly, 1000001 is POWER(2,6) + POWER(2,0) = 64 + 1 = 65.

To manipulate binary numbers, there are standard ‘logical operators’ AND, OR, EOR that take two binary numbers as input and return a binary number as output; see Bitwise operation at Wikipedia for more detail. The operator NOT takes one input and simply ‘flips’ the bits such that 0 becomes 1, and 1 becomes 0. We don’t have to explicitly convert our numbers to binary to use these operators, as integers are already stored as binary numbers in memory.

If you want to convert between binary (aka ‘base 2’) and decimal (‘base 10’), you can use the functions from my previous post, Encrypting IDs using a custom alphabet:

SELECT dbo.fn_BaseEncrypt(65, '01')
1000001
SELECT dbo.fn_BaseDecrypt('1000001', '01')
65


(Non-Sudoku) Example

We have 7 employees (identified by EmployeeID) that work various days of the week. We represent the data in two different datasets: in the first, the data is normalised, and each row contains the EmployeeID, and a Day value to represent which day that employee works: 1 = Sunday, 2 = Monday, …, 7 = Saturday; one row per day. In the second dataset, there’s only one row per employee, and a single integer indicates the combination of days: the 1st (right-most) bit is ‘1’ if they work Sunday, ‘0’ if not; the 2nd bit is ‘1’ if they work Monday, ‘0’ if not; etc, up to the 7th bit representing Saturday. This integer can take all values from 0 (no days worked) to 127 (all days worked).

Using our datasets, we want to answer some straightforward questions:

  1. Which employees work on Wednesday?
  2. Which employees work on Monday and Wednesday?
  3. Which employees work on Wednesday but not Monday?
  4. Which employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?
  5. Which employees work 5 or more days in a week?

The way in which these questions are answered using the two different datasets (normalised and binary) will demonstrate how using binary data can sometimes lead to simpler code.


Normalised data

Here’s our normalised data, one record per EmployeeID / Day:

DROP TABLE IF EXISTS #DaysWorked
GO

CREATE TABLE #DaysWorked
(
	EmployeeID INT NOT NULL
	,[Day] TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID, [Day])
)
GO

INSERT #DaysWorked(EmployeeID, [Day])
	VALUES
	 (10,2),(10,3),(10,4),(10,5),(10,6)		-- Mon to Fri
	,(20,2),(20,3),(20,4),(20,5)			-- Mon to Thu
	,(30,6),(30,7),(30,1)				-- Fri, Sat, Sun
	,(40,3),(40,4),(40,5)				-- Tue, Wed, Thu
	,(50,2),(50,3),(50,4),(50,5),(50,6),(50,7)	-- Mon to Sat
	,(60,3),(60,4),(60,5),(60,6),(60,7)		-- Tue to Sat
	,(70,1)						-- Sun only
GO

Binary data

Here’s the same data, but using single a integer per employee, DayBits, to represent the collection of days worked:

DROP TABLE IF EXISTS #DaysWorkedBinary
GO

CREATE TABLE #DaysWorkedBinary
(
	EmployeeID INT NOT NULL
	,DayBits TINYINT NOT NULL
	,PRIMARY KEY CLUSTERED (EmployeeID)
	,INDEX IX_DaysWorkedBinary_DayBits (DayBits)
)
GO

INSERT #DaysWorkedBinary(EmployeeID, DayBits)
	VALUES		-- SFTWTMS
	 (10, 62)	-- 0111110
	,(20, 30)	-- 0011110
	,(30, 97)	-- 1100001
	,(40, 28)	-- 0011100
	,(50,126)	-- 1111110
	,(60,124)	-- 1111100
	,(70,  1)	-- 0000001
GO

As you can see from the comments in green, each day corresponds to a 0/1 bit, where the values of those bits are the powers of 2: from the right, 1,2,4,8,16,32 and 64.

For the sake of completeness — just how did we calculate the values of the DayBits field? Like this:

SELECT
		d.EmployeeID
		,DayBits = SUM(POWER(2, d.[Day]-1))
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	ORDER BY d.EmployeeID
GO

To go back the other way, turning the binary data into normalised, one row per employee/day:

SELECT
		d.EmployeeID
		,[Day] = v.n
	FROM #DaysWorkedBinary d
	JOIN (
		VALUES(1),(2),(3),(4),(5),(6),(7)
	) v(n)
		ON d.DayBits & POWER(2, v.n-1) > 0 
	ORDER BY d.EmployeeID, [Day]
GO

So, we have our two datasets (normalised data: #DaysWorked, and binary data: #DaysWorkedBinary), and we can start to answer our five questions.


Questions 1, 2 and 3

Normalised version

Answering question 1, “Which employees work on Wednesday?” is trivial:

SELECT EmployeeID
	FROM #DaysWorked
	WHERE [Day] = 4
	ORDER BY EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

For question 2, “Which employees work on Monday and Wednesday?”, we need to work a little harder:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
	WHERE d1.[Day] = 4 
	AND d2.[Day] = 2
	ORDER BY d1.EmployeeID
GO
EmployeeID
-----------
10
20
50

We can write this query in other ways:

-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day] IN (2,4),1,0)) = 2
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 1 AND [4] = 1
	ORDER BY EmployeeID

All four queries return the same result set. Note that a and b have identical execution plans; the plans for c and d aren’t identical, but they’re very similar to each other.

Question 3, “Which employees work on Wednesday but not Monday?”, is similar in construction to question 2:

-- 2a
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	LEFT JOIN #DaysWorked d2
		ON d2.EmployeeID = d1.EmployeeID
		AND d2.[Day] = 2
	WHERE d1.[Day] = 4 
	AND d2.EmployeeID IS NULL
	ORDER BY d1.EmployeeID
-- 2b
SELECT d1.EmployeeID
	FROM #DaysWorked d1
	WHERE d1.[Day] = 4 
	AND d1.EmployeeID NOT IN (SELECT d2.EmployeeID FROM #DaysWorked d2 WHERE d2.[Day] = 2)
	ORDER BY d1.EmployeeID
-- 2c
SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING SUM(IIF(d.[Day]=4,1,0)) = 1
	AND SUM(IIF(d.[Day]=2,1,0)) = 0
	ORDER BY d.EmployeeID
-- 2d
SELECT EmployeeID
	FROM #DaysWorked
	PIVOT ( COUNT([Day]) FOR [Day] IN ([2],[4])) p
	WHERE [2] = 0 AND [4] = 1
	ORDER BY EmployeeID

Here, the execution plans for a and b are very similar, but not identical.

Binary version

Now we’ll answer the same questions, but using our binary dataset.

Question 1:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 8 = 8
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
20
40
50
60

The same results as before; the only real structural difference is in the WHERE clause. We’re using ‘&’ (AND) to pull out the records that match. The 8 is the 4th bit from the right (1,2,4,8), i.e. Wednesday. Our query returns those records for which the DayBits integer has the 4th bit ‘set’ (or ‘switched on’).

Question 2:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 10
	ORDER BY d.EmployeeID

Same premise as before, but we’re checking whether two bits (the 2nd and 4th, corresponding to Monday and Wednesday) are set.

Question 3:

SELECT d.EmployeeID
	FROM #DaysWorkedBinary d
	WHERE d.DayBits & 10 = 8 
	ORDER BY d.EmployeeID

Similar structure, but here we only want those records where the 4th bit is set (equal to 1), and the 2nd is unset (equal to zero).


Question 4

This is a trickier question: “Which pairs of employees work opposite days to each other, such that between them they work the whole week, but neither works on the same day?”.

Here’s my solution:

Normalised version

;WITH cte_CrossJoin AS
(
	SELECT
			EmployeeID1	= d1.EmployeeID 
			,EmployeeID2	= d2.EmployeeID
			,Emp1Days	= COUNT(DISTINCT d1.[Day])
			,Emp2Days	= COUNT(DISTINCT d2.[Day])
			,[Matched]	= SUM(IIF(d2.[Day] = d1.[Day], 1, 0))
		FROM #DaysWorked d1
		CROSS JOIN #DaysWorked d2
		WHERE d2.EmployeeID > d1.EmployeeID
		GROUP BY d1.EmployeeID , d2.EmployeeID
)
	SELECT EmployeeID1, EmployeeID2
		FROM cte_CrossJoin
		WHERE Emp1Days + Emp2Days = 7
		AND [Matched] = 0
GO
EmployeeID1 EmployeeID2
----------- -----------
20          30
50          70

(I have to admit, I went through several different attempts — this is number 4 or 5. It’s reasonably compact, but due to the CROSS JOIN, it’s more ‘brute-force’ than I’d like. If you have a more refined solution, please let me know!)

Binary version

The binary version is simpler:

SELECT
		Employee1 = d1.EmployeeID
		,Employee2 = d2.EmployeeID
	FROM #DaysWorkedBinary d1
	JOIN #DaysWorkedBinary d2
		ON d2.EmployeeID > d1.EmployeeID
	WHERE d1.DayBits ^ d2.DayBits = 127
GO

In SQL, the ‘^’ (caret) character means ‘EOR’, ‘exclusive OR’ (sometimes seen as ‘XOR’). The bits representing the days have to be different on each side, but they have to exist on at least one.


Question 5

The final question is “Which employees work 5 or more days in a week?”. This is trivial using the normalised data; and given that, so far, all the queries using the binary data are smaller, you’d expect that to be the case here too.

Normalised version

Very straightforward, we use a HAVING clause to restrict the number of records (in our case, employee/days) to greater than or equal to 5:

SELECT d.EmployeeID
	FROM #DaysWorked d
	GROUP BY d.EmployeeID
	HAVING COUNT(1) >= 5
	ORDER BY d.EmployeeID
GO
EmployeeID
-----------
10
50
60

Binary version

Sadly, the binary version isn’t particularly neat. This is one of the limitations of using binary data: there’s no native function to count how many bits are ‘on’ (set) in a given integer. (If there was, it would again be trivial.) We have to do the counting ourselves; first, a longhand version:

;;WITH cte_BitCount AS
(
	SELECT
			EmployeeID
			,BitCount =	  IIF(DayBits&1>0,1,0)  + IIF(DayBits&2>0,1,0)
					+ IIF(DayBits&4>0,1,0)  + IIF(DayBits&8>0,1,0)
					+ IIF(DayBits&16>0,1,0) + IIF(DayBits&32>0,1,0)
					+ IIF(DayBits&64>0,1,0) 
		FROM #DaysWorkedBinary
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID

GO

And here’s a more extensible way of doing the same calculation:

;;WITH cte_BitCount AS
(
	SELECT
			d.EmployeeID
			,BitCount = SUM(SIGN(d.DayBits & POWER(2,v.n-1)))
		FROM #DaysWorkedBinary d
		CROSS APPLY (
			VALUES(1),(2),(3),(4),(5),(6),(7)
		) v(n)
		GROUP BY d.EmployeeID
)
SELECT EmployeeID
	FROM cte_BitCount
	WHERE BitCount >= 5
	ORDER BY EmployeeID
GO	

but it’s still not as concise as the query over the normalised data.

Some further reading on counting bits:

And finally

I hope I’ve demonstrated that there are times when using binary / bit-wise logic makes your code more elegant and compact. It’s definitely useful in cases where there are fixed numbers of discrete ‘flags’ — units of time like days and months being good examples, as they’re never going to change — but you have to be careful not to implement these patterns without some consideration. For one, they may not be particularly friendly to colleagues who don’t have a maths/CS background. You can always swap between normalised/binary ‘on the fly’ (as per the queries near the top that convert one dataset into the other) if you need to.

There’s some more to be written about aggregating over binary data, as there are no in-built SQL functions for this; hopefully that’ll be a future post.

, , ,

Leave a comment

Encrypting IDs using a custom alphabet

Let’s say you want to track link clicks in a marketing email you’re sending out. The best way to track these is to put a GUID in all the links, so every click is uniquely identifiable. In order for this to work, you’ll need to have a record of all the GUIDs, stored alongside the destination email address (and/or customer ID). But if you’re sending many emails to millions of customers, that’s a lot of GUIDs to store.

If you’re not too worried about people fiddling with the links, you could encrypt an ID (e.g. CustomerID), and put that in the link instead of a GUID; that means you wouldn’t have to keep any record in your database, you could just decrypt the ID string next time you saw it. So let’s look at a way in which we can encrypt IDs… but, just to be clear, this is quite weak encryption, you shouldn’t use it for anything important. Only consider it if the worst thing that can happen as a result of someone cracking your code is mild inconvenience!

One way to encrypt an integer ID is to change its base, and use a randomised alphabet. Here’s the function I’m going to use:

CREATE OR ALTER FUNCTION dbo.fn_BaseEncrypt
(
	@base10 INT
	,@alphabet VARCHAR(255)
)
RETURNS VARCHAR(255)
AS
BEGIN

	DECLARE @newBaseLen INT = LEN(@alphabet)

	DECLARE @remaining INT, @result VARCHAR(255) = ''

	SET @remaining = @base10

	WHILE (@remaining > 0)
	BEGIN
		SET @result = SUBSTRING(@alphabet, 1 + (@remaining % @newBaseLen), 1) + @result
		SET @remaining = @remaining / @newBaseLen
	END

	RETURN(@result)

END
GO

(NB: None of the code presented here is bullet-proof, production-ready code. For clarity, I’ve left out things like checking the input parameters.)

It’s reasonably simple: there’s a loop that repeatedly modulos (that’s the ‘%’ character) and divides the starting number, adding the relevant indexed character from our alphabet each time.

Some simple examples:

SELECT
		[Description]
		,Alphabet
		,[Value]
		,Result = dbo.fn_BaseEncrypt([Value], Alphabet)
	FROM 
	(
		VALUES
			('Binary', '01', 252)
			,('Trinary', '012', 253)
			,('Octal', '01234567', 254)
			,('Base 9', '012345678', 1234567)
			,('Hexadecimal', '0123456789ABCDEF', 98765432)
	) x([Description], Alphabet, [Value])
GO
Description Alphabet         Value       Result
----------- ---------------- ----------- ---------
Binary      01               252         11111100
Trinary     012              253         100101
Octal       01234567         254         376
Base 9      012345678        1234567     2281451
Hexadecimal 0123456789ABCDEF 98765432    5E30A78

Don’t take my word for it that these values are correct! For the hexadecimal one, you can type '98765432 in hex' straight into your address bar in Google Chrome. For the others, there are sites like ExtraConversion and RapidTables.

Here’s the accompanying decryption function:

CREATE OR ALTER FUNCTION dbo.fn_BaseDecrypt
(
	@strValue VARCHAR(255)
	,@alphabet VARCHAR(255)
)
RETURNS INT
AS
BEGIN
	DECLARE @newBaseLen INT = LEN(@alphabet)
	
	DECLARE @result INT = 0, @m INT = 1, @i INT, @l INT = LEN(@strValue)
	DECLARE @revStrValue VARCHAR(255) = REVERSE(@strValue)
	SET @i = @l
	WHILE(@i > 0)
	BEGIN
		SET @result = @result + @m * (CHARINDEX(SUBSTRING(@revstrValue, @l-@i+1, 1) , @alphabet) - 1)
		SET @m = @m * @newBaseLen
		SET @i = @i - 1
	END
	RETURN(@result)
END
GO

Now, clearly anyone can ‘decrypt’ our example strings, because the ‘alphabet’ is obvious in each case. But what if we choose a completely different set of characters, like A-Z and 0-9, randomised?

-- replace the VALUES(...):
VALUES
	 ('Random #1', '1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y', 123456789)
	,('Random #2', 'C9VGTJ34ZXHMKEWUA28DNP6FYRB', 123456789)
Description Alphabet                             Value       Result
----------- ------------------------------------ ----------- ------
Random #1   1L3V4U97EIZHKMPO0F6SJDTARGW8BC5XQN2Y 123456789   3L6VXI
Random #2   C9VGTJ34ZXHMKEWUA28DNP6FYRB          123456789   ZAZ3F8

For ‘Random #2’, I’ve removed the characters 0,O,I,1,7,L,5,S and Q. You might want to do this if the resulting string is likely to be read back to you over the phone.

Aside: quick way of generating a randomised alphabet, using R:

paste(sample(c(LETTERS, 0:9)),collapse="")

So, if someone was to see a string of characters like ‘3L6VXI’, it would be meaningless to them; they don’t have the original alphabet in order to decode it.

However, keeping with the ‘Random #2’ alphabet, what if we look at encryptions of consecutive values:

Value       Result
----------- ------
10001       EDM
10002       EDK
10003       EDE
10004       EDW
10005       EDU
10006       EDA
10007       ED2
10008       ED8
10009       EDD
10010       EDN
10011       EDP
10012       ED6
10013       EDF
10014       EDY
10015       EDR
10016       EDB
10017       ENC
10018       EN9
10019       ENV
10020       ENG

Given enough examples of an encrypted ID, any competent ‘hacker’ is going to be able to deduce the generating alphabet. Instead of using sequential IDs, you could use some easily reversible function: 11 * {ID} + 13, for example. But we can get more creative than that! Here are some ways you could further obfuscate the string:

  • Reverse it
  • Rotate it by n characters to the left/right
  • Re-order the string entirely (with a fixed re-ordering)
  • Add spacing characters, e.g. a hyphen, at meaningless positions
  • Add random characters at pre-defined positions
  • Add one or more check digits – also consider SQL’s CHECKSUM() function.
  • Use more than one encryption/decryption alphabet – maybe part of the string indicates which alphabet to use?
  • Concatenate / interleave multiple encrypted strings – you might have one string for CustomerID, one that identifies the particular email template, and one for the link location within the email

Of course, you MUST check that your end-to-end encryption/decryption process works on all inputs you’re likely to put through it! Given that your PC can probably work through millions of encrypt/decrypt cycles per second, there’s no reason not to check that your functions will work without error in the future.

, , , ,

Leave a comment

Interest and APR: a short explainer

I needed recently to check the monthly payment amount for a simple loan; choosing four online loan calculators at random, I got two slightly different values. In this PDF document Interest rates & APR (v0.1), I’ve detailed the differences (spoiler: it’s due to APR) and shown how the figures from the online calculators were arrived at. As ever, I hope someone finds it useful!

A plea: If you know of any book that goes into the various types of loan scheme in some depth, in a fairly formal and systematic way, please do let me know; it’s an area I find fascinating, yet there doesn’t seem to be a single definitive online resource with quite enough depth.

Finally, I must sing the praises of Wolfram Alpha, it’s been invaluable for my work for the past couple of years. Between Wolfram Alpha, wxMaxima and R, they’ve been able to solve any equation I’ve had cause to throw at them!

, , , ,

Leave a comment

Using PowerShell to restore .bak files to Azure SQL

PowerShell

“This should be easy!”

The mini-project I’m working on this week requires us to restore third-party data from SQL Server database backup files (.bak, spoken as ‘dot back’, files) into our main Azure SQL instance — once per day, every day. Somewhat surprisingly, this isn’t straightforward: you can’t simply point one of the Azure tools at a .bak file and restore it to your cloud SQL Server instance.

If you don’t believe me, here are some examples of people asking the question and being disappointed:

We even used up some of our support credits to ask a proper expert – he gave a wry chuckle and said that this (the ability to restore .bak files) was the most frequent request in the Azure SQL community!


Therefore, we had to ‘roll our own’ process. With a new Windows box built especially for this project, we needed code to do the following sub-tasks:

  1. SFTP the compressed and encrypted files from the third parties to our box
  2. Decompress and unpack the .bak files
  3. Restore the .bak files to SQL Server Express (installed locally)
  4. Export (some of) the data to Azure

What was the best way forward? Well, it’s PowerShell, no? (Let me know if you think otherwise!)

Although I’ve got (too many years of) experience using old-school .BAT and VBScript files in Windows (which still work, but are decidedly ‘old hat’ now), I’ve never really used PowerShell before, but it seemed to be absolutely the right ‘glue’ to get this project working. To keep things simple, I decided to just use Notepad++ to create/edit the scripts (which have a .ps1 file extension), no IDE or anything.

For each of the first three steps above, I’m just going to present some, hopefully useful, notes. The fourth step, exporting the data from SQL Server Express to Azure — I didn’t do that bit; I believe the colleague who was assigned used a client-side version of ADF. (I don’t really know ADF very well at all.)


1. SFTP-ing

(SFTP is basically ‘secure FTP’ – that is, transferring a file over the internet, but securely. Officially, it looks like SFTP stands for ‘SSH File Transfer Protocol’.)

WinSCP

There’s a very popular Windows SFTP client WinSCP. The GUI is a wrapper around a powerful .NET assembly (in old money: a DLL), and we can use this at run-time in our PowerShell script. On my machine, the DLL resides here: C:\Program Files (x86)\WinSCP\WinSCPnet.dll

I pretty much copied the code from here: Help with SFTP Download Script

Now, one of the properties of the WinSCP object that we had to set in code was SshHostKeyFingerprint. I confess: I didn’t know what that was. This link About the SSH host key fingerprint says:

“Every SSH server is configured to use a host key to verify that the client is connecting to the correct host. The SSH server administrator provides the host key fingerprint to the various clients. The clients are expected to manually verify the host key while connecting to the server using any SSH client.”

Effectively, it’s just another piece of information to ensure that you’re connecting to the correct server.

Where do we get this key from? There’s a PowerShell function here: Get-SshFingerprint that does exactly what you want; you give it a domain name, and it returns the SSH host key fingerprint.

When I ran it on our third-party hosts, I got results like the following:

ssh-ed25519 255 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

or

ssh-rsa 2048 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(where I’ve xxx-ed out the real values) So far, so good. One of our third parties uses old-school username/password connection details, that look something like this:

HostName		= "sftp.thirdpartyA.com"
UserName		= "pete"
Password		= "s00pers4f3p455w0rd"
SshHostKeyFingerprint	= "ssh-rsa 2048 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Given the SshHostKeyFingerprint, the SFTP transfer worked perfectly. Now for a minor complication: another third party gave us some slightly different details: a .ppk file and a ‘passphrase’. These worked fine when doing the SFTP transfer manually in FileZilla (that is, using a GUI). But it’s ok, because our WinSCP object has the requisite properties (in bold):

HostName		= "sftp.thirdpartyB.com"
UserName		= "pete"
SshPrivateKeyPath	= "very_private_key.ppk"
PrivateKeyPassphrase	= "v3rys3cr3tp455phr4s3"
SshHostKeyFingerprint	= "ssh-ed25519 255 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 

But when we ran this code, it errored out. We looked in the WinSCP log file:

"Unable to use key file "very_private_key.ppk" (OpenSSH SSH-2 private key (old PEM format))"

It turns out that the .ppk file we’d been given contained the key in an old format, and WinSCP required a newer format. You have to convert the ‘OpenSSH’ key to a ‘PuTTY’ key; helpfully, when you install WinSCP on Windows, you get a tool for doing the conversion, called puttygen.exe. On my PC, it’s at C:\Program Files (x86)\WinSCP\PuTTY\puttygen.exe. It’s a GUI, not command-line, but it’s very straightforward: you open the old .ppk file, then save it as a new version.


2. Decompressing and unpacking the .bak files

The popular zipping program 7zip has its own PowerShell module; see Stackoverflow unzip file using 7z in powershell. Then in the PowerShell script, it’s as simple as:

Expand-7Zip -ArchiveFileName $sourceFile -TargetPath $destinationFolder

Once the files were unzipped, we had to decrypt them using a command-line version of RedGate’s SQB2MTF utility. This was straightforward, the only thing to note was that you can run PowerShell scripts from inside other scripts by doing:

& $path_to_SQB2MTF_exe $parameters

where the ampersand is the ‘call operator’ — see What does the & symbol in powershell mean?


3. Restoring .bak files to SQL Server Express

There’s a native PowerShell command Restore-SqlDatabase that “Restores a database from a backup or transaction log records”. Perfect… except it didn’t work. The command tried to restore the database files to the exact same file locations they’d been backed up from, e.g. S:\DATA\ThirdPartyDB.MDF — this wouldn’t ever work, our database drive set-up is wholly different.

It looks like there are workarounds to this, but it involves quite a lot of intimidating-looking code. Luckily, someone mentioned the module dbatools, which was a godsend. Once we’d installed it with:

Install-Module dbatools

, the code in the script look like this:

$Server		= "OURMACHINE\SQLEXPRESS"
$BAKFolder	= "D:\Location_Of_Our_BAK_files\"
$SQLDataFolder	= "C:\Program Files\Microsoft SQL Server\MSSQL15.SQLEXPRESS\MSSQL\DATA"

Restore-DbaDatabase -SqlInstance $Server -Path $BAKFolder -WithReplace -DestinationDataDirectory $SQLDataFolder

Basically one line of code! dbatools looks fantastically powerful, I’ll almost certainly be using it again for other projects.


Scheduling the scripts

The final piece of the puzzle was to get these scripts to run on a schedule (usually once every day, early in the morning), without a user being logged on to the machine. Unfortunately, SQL Server Express doesn’t come with (the otherwise indispensable) SQL Server Agent. Therefore, we have to use Windows’ inferior Task Scheduler. See e.g. How to Automate PowerShell Scripts with Task Scheduler to learn how to set up a new scheduled task. In order to get this to work, we needed the following details in the ‘Edit Action’ box:

Program/script: powershell.exe
Add arguments: -noprofile -executionpolicy bypass -file C:\OurScriptFolder\OurScript.ps1

(taken from Stackoverflow: Powershell script does not run via Scheduled Tasks)

Piece of cake, right? Yeah, that didn’t go too smoothly either — two things that it’s critical to take into account:

  1. Tasks are ‘run as’ a user of your choice. Pick the wrong user, and [all or maybe part of] your task doesn’t run….
  2. … and, unfortunately, Task Scheduler is terrible at feedback. From the GUI, it looks like your task ran successfully, when in fact, nothing happened. You have to do a lot of logging to a file to work out where your script stops working.

After most of a working day, and much tearing of hair, let me share the following with you:

  • The 7Zip4PowerShell module (which contains the Expand-7Zip command) won’t run under the SYSTEM account; we had to create a new user, and install the module (in PowerShell) while logged in as that account.
  • If it’s all part of the same task, this new user will need appropriate permissions to restore databases in SQL Server Express.
  • Just because the task runs a script that lives in a certain folder, that doesn’t mean the context is set to that folder. We had to put a Set-Location “C:\OurScriptFolder\” command at the top of the script that was called by the task.

End notes

In total, it took just over a day to pull all the code together; PowerShell is fairly easy to write and debug, and nearly everything I needed to do, someone else had already done it, so it was only a matter of googling. Really, the only original code I wrote was to do with keeping track of progress, logging results to files, etc.

Setting up the scheduled task was a pain, but with some patience (and a little help from a colleague), I got there in the end!

, , , , , , , ,

Leave a comment

NULL doesn’t look like anything (to me)

A quick post to mention something that nearly caught me out recently. I was re-working some old code (author no longer at the company) and came across SQL similar in intent to the below: (I’ve made the code more exciting by using data about fancy cars, rather than phone systems!)

;WITH cte_Models AS
(
	SELECT ModelID, Make, ModelName
		FROM (
			VALUES
			 (1,'Lancia','Flavia')
			,(2,'Lancia','Gamma')
			,(3,'Lancia','Prisma')
			,(4,'Lotus','Elise')
			,(5,'Lotus','Evora')
			,(6,'Lotus','Exige')
			,(7,'Lotus','Elan')
			,(8,'Land Rover','Discovery')
			,(9,'Land Rover','Defender')
			,(10,'Lamborghini','Aventador')
			,(11,'Lamborghini','Huracan')
			,(12,'Lamborghini','Urus')
			,(13,'Lamborghini','Asterion')
			,(14,'Lamborghini','Sian')
			,(15,'Lamborghini','Estoque')
			,(16,'Chrysler','Conquest')
			,(17,'Chrysler','Galant')
			,(18,'Chrysler','Imperial')
			,(19,'Chrysler','Windsor')
		) x(ModelID, Make, ModelName)

), cte_Country AS
(
	SELECT
			Make, Country
		FROM (
			VALUES
				 ('Lancia', 'Italy')
				,('Lotus', 'United Kingdom')
				,('Land Rover', 'United Kingdom')
				,('Chrysler', 'United States')
				-- Note: No entry for 'Lamborghini' (which would be 'Italy')
		) x(Make, Country)
)
	SELECT
			m.ModelID
			,m.Make
			,m.ModelName
			,c.Country
		FROM cte_Models m
		LEFT JOIN cte_Country c
			ON c.Make = m.Make
		WHERE c.Country NOT LIKE 'Un%'
		ORDER BY m.ModelID

The code that gave me pause was the LEFT JOIN, with a NOT LIKE on a field from that table in the WHERE clause (the bit in bold). What does that give us?

Here’s what I expected to see:

ModelID     Make        ModelName Country
----------- ----------- --------- ----------
1           Lancia      Flavia    Italy
2           Lancia      Gamma     Italy
3           Lancia      Prisma    Italy
10          Lamborghini Aventador NULL
11          Lamborghini Huracan   NULL
12          Lamborghini Urus      NULL
13          Lamborghini Asterion  NULL
14          Lamborghini Sian      NULL
15          Lamborghini Estoque   NULL

But if you run the code, here’s what you actually get:

ModelID     Make        ModelName Country
----------- ----------- --------- -------
1           Lancia      Flavia    Italy
2           Lancia      Gamma     Italy
3           Lancia      Prisma    Italy

Where are the entries for ‘Lamborghini’?

I suspect many people will already know the answer: c.Country is NULL for Lamborghini (because there’s no match on Make), and NULL has its own rules when it comes to SQL comparison functions such as LIKE and NOT LIKE. In our code, the NOT LIKE comparison evaluates to a false, and the records are excluded.

To make it more explicit:

SELECT
	[LIKE]		= IIF(NULL     LIKE 'Un%', 1, 0)
	,[NOT LIKE]	= IIF(NULL NOT LIKE 'Un%', 1, 0)

LIKE        NOT LIKE
----------- ---------
0           0 

If we were comparing a non-NULL string, then it would have to result in one of those bits being a one. But NULL isn’t a string, and you have to be careful! (See this stackoverflow post for more information.)

Here’s the extra line of SQL we’d need if we wanted our Lamborghini records returned:

...
LEFT JOIN cte_Country c
	ON c.Make = m.Make
WHERE c.Country NOT LIKE 'Un%'
OR c.Country IS NULL
ORDER BY m.ModelID

Sad note: I’m not entirely sure the original code took this into account, so it could very well be wrong, and therefore returning an incomplete recordset. But there are zero comments in the code, so I can’t tell whether the output is as-intended or not! 🙃

, ,

Leave a comment

Re-creating complete records from audit data

Quite often, you’ll find that changes to data in a given table have been recorded in a separate table, with fields like this:

[field name], [timestamp], [old value], [new value]

Or maybe the data changes for ALL tables are recorded in the same place, in which case we’d have a column at the start telling us what object had been updated:

[object id], [field name], [timestamp], [old value], [new value]

Let’s say we have a Customer table, and the changes are recorded in a separate audit table. So if, say, the first name, date of birth, and postcode for a customer with ID 1234 were changed at the same time, we’d expect to see these records in our audit table:

CustomerID  FieldName   Timestamp         OldValue   NewValue
----------- ----------- ----------------- ---------- ----------
1234        FirstName   20200708 09:58:11 Rich       Richard
1234        DateOfBirth 20200708 09:58:11 1981-04-03 1981-04-02
1234        Postcode    20200708 09:58:11 SW1 1AA    SW1 2AA

It’s very common to want to re-format this data so that we get one row (resembling the original Customer record) per change, such that the row contains all the fields as they were at the time of the change. How do we do that? If we PIVOT just the new values, we get:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-02  SW1 2AA

Looks ok. (We need to generate a record for the old values, but we’ll worry about that later.)

But what if the timestamps weren’t all the same – i.e. the three changes didn’t happen at the same time – and we did the PIVOT?

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    NULL        NULL
1234        20200709 09:58:12 NULL       1981-04-02  NULL
1234        20200710 09:58:13 NULL       NULL        SW1 2AA

This isn’t what we want, those NULLs shouldn’t be there. The above is just showing the new values for the field that’s changed. What we really want is this:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

which shows ALL the fields as they were at the given date/time. And maybe an initial record, prior to any changes:

CustomerID  Timestamp         FirstName  DateOfBirth Postcode
----------- ----------------- ---------- ----------- ----------
1234        20010101 00:00:00 Rich       1981-04-03  SW1 1AA 
1234        20200708 09:58:11 Richard    1981-04-03  SW1 1AA
1234        20200709 09:58:12 Richard    1981-04-02  SW1 1AA
1234        20200710 09:58:13 Richard    1981-04-02  SW1 2AA

Obviously, in the data as shown, we have no idea what the original timestamp was; we need to use something that makes sense in context, like the date the original Customer record was written for the first time. (I’ve only used 1st Jan 2001 here to make the substitution obvious.)

How can we achieve this?

Worked example

Let’s work through a solution, using example data. Our data consists of two people who have an id, a height, a credit score, and a postcode.

IF OBJECT_ID('tempdb.dbo.#Changes','U') IS NOT NULL
BEGIN
	DROP TABLE #Changes
END
GO

SELECT
		PersonID	= CAST(PersonID AS INT)
		,[Date]		= CAST([Date] AS DATE)
		,[Field]
		,[From]
		,[To]
	INTO #Changes
	FROM (
		VALUES
			 (101, '2010Jan01', 'Height', '160', '161')
			,(101, '2010Jan01', 'CreditScore', '541', '542')
			,(101, '2010Feb02', 'Height', '161', '162')
			,(101, '2010Feb03', 'CreditScore', '542', '538')
			,(101, '2011Mar03', 'Height', '162', '163')
			,(101, '2011Mar03', 'CreditScore', '538', '536')
			,(101, '2012Dec30', 'Postcode', 'W1A 1AA', 'SW1A 2AA')
			,(102, '2015Jul15', 'Postcode', 'SW1A 1AA', 'SE1 9TG')
			,(102, '2015Aug21', 'Postcode', 'SE1 9TG', 'SE1 9DT')
			,(102, '2016Apr04', 'CreditScore', '602', '606')
			,(102, '2017May05', 'Height', '150', '148')
	) x(PersonID, [Date], [Field], [From], [To])
GO

SELECT * FROM #Changes
GO
PersonID    Date       Field       From     To
----------- ---------- ----------- -------- --------
101         2010-01-01 Height      160      161
101         2010-01-01 CreditScore 541      542
101         2010-02-02 Height      161      162
101         2010-02-03 CreditScore 542      538
101         2011-03-03 Height      162      163
101         2011-03-03 CreditScore 538      536
101         2012-12-30 Postcode    W1A 1AA  SW1A 2AA
102         2015-07-15 Postcode    SW1A 1AA SE1 9TG
102         2015-08-21 Postcode    SE1 9TG  SE1 9DT
102         2016-04-04 CreditScore 602      606
102         2017-05-05 Height      150      148

Note that the From/To fields are strings; that’s the only way to record this data for fields that could be of any type. (We’ll have to remember to convert the fields back to their proper datatypes when we’re done.)

How many records are we expecting in our final recordset?

SELECT PersonID, [Date]
FROM #Changes
GROUP BY PersonID, [Date]
PersonID    Date
----------- ----------
101         2010-01-01
101         2010-02-02
101         2010-02-03
101         2011-03-03
101         2012-12-30
102         2015-07-15
102         2015-08-21
102         2016-04-04
102         2017-05-05

For PersonID = 1, our final result set will have 5 rows, plus 1 for our initial record (prior to any changes), a total of 6 rows. For PersonID 2, we’ll have 5 (4+1).

First, let’s expand the data out so we have a row per change per field (plus our initial ‘pre-change’ row):

IF OBJECT_ID('tempdb.dbo.#Expanded','U') IS NOT NULL
BEGIN
	DROP TABLE #Expanded
END
GO

;WITH cte_AllChanges AS
(
	SELECT
			x.PersonID
			,x.[Date]
			,y.[Field]
			,c.[To] AS [Value]
		FROM (
			-- All the possible PersonIDs and Dates:
			SELECT PersonID, [Date]
				FROM #Changes
				GROUP BY PersonID, [Date]
		) x
		-- All the possible fields:
		CROSS JOIN (
			SELECT [Field]
				FROM #Changes
				GROUP BY [Field]
		) y
		-- Join on the values we know about:
		LEFT JOIN #Changes c
			ON c.PersonID = x.PersonID
			AND c.[Date] = x.[Date]
			AND c.[Field] = y.[Field]

), cte_InitialRows AS
(
	-- The 'initial' rows for each of our PersonIDs
	SELECT
			PersonID
			,'2001Jan01' AS [Date]
			,[Field]
			,[From] AS [Value]
			,rn = ROW_NUMBER() OVER (PARTITION BY PersonID, [Field] ORDER BY [Date])
		FROM #Changes
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		INTO #Expanded
		FROM cte_AllChanges
	
	UNION ALL

	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
		FROM cte_InitialRows
		WHERE rn = 1
GO

SELECT * FROM #Expanded
ORDER BY PersonID, [Date], Field
GO
PersonID    Date       Field       Value
----------- ---------- ----------- --------
101         2001-01-01 CreditScore 541
101         2001-01-01 Height      160
101         2001-01-01 Postcode    W1A 1AA
101         2010-01-01 CreditScore 542
101         2010-01-01 Height      161
101         2010-01-01 Postcode    NULL
101         2010-02-02 CreditScore NULL
101         2010-02-02 Height      162
101         2010-02-02 Postcode    NULL
101         2010-02-03 CreditScore 538
101         2010-02-03 Height      NULL
101         2010-02-03 Postcode    NULL
101         2011-03-03 CreditScore 536
101         2011-03-03 Height      163
101         2011-03-03 Postcode    NULL
101         2012-12-30 CreditScore NULL
101         2012-12-30 Height      NULL
101         2012-12-30 Postcode    SW1A 2AA
102         2001-01-01 CreditScore 602
102         2001-01-01 Height      150
102         2001-01-01 Postcode    SW1A 1AA
102         2015-07-15 CreditScore NULL
102         2015-07-15 Height      NULL
102         2015-07-15 Postcode    SE1 9TG
102         2015-08-21 CreditScore NULL
102         2015-08-21 Height      NULL
102         2015-08-21 Postcode    SE1 9DT
102         2016-04-04 CreditScore 606
102         2016-04-04 Height      NULL
102         2016-04-04 Postcode    NULL
102         2017-05-05 CreditScore NULL
102         2017-05-05 Height      148
102         2017-05-05 Postcode    NULL

(33 rows affected)

Now for the clever bit. For each of the NULLs in the recordset above, we want to fill in its last known value; that is, whatever preceding value we had for that field that wasn’t a NULL. On the face of it, it’s not a trivial requirement, there’s no nice in-built window function to help us. Luckily, there’s a great blog post by Itzik Ben-Gan: The Last non NULL Puzzle that gives us the answer. (I have it book-marked, I use it a lot!)

Adapting Itzik Ben-Gan’s code from that website, we have:

IF OBJECT_ID('tempdb.dbo.#LastKnownNull','U') IS NOT NULL
BEGIN
	DROP TABLE #LastKnownNull
END
GO

;WITH cte_LastNonNull AS
(
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,grp = MAX(CASE WHEN [Value] IS NOT NULL THEN [Date] END)
					OVER (
						PARTITION BY PersonID, [Field]
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
		FROM #Expanded
)
	SELECT
			PersonID
			,[Date]
			,[Field]
			,[Value]
			,[Last non-NULL Value] = MAX([Value])
					OVER (
						PARTITION BY PersonID, [Field], grp
						ORDER BY [Date]
						ROWS UNBOUNDED PRECEDING
					)
	INTO #LastKnownNull
	FROM cte_LastNonNull
GO

SELECT * FROM #LastKnownNull
GO
PersonID    Date       Field       Value    Last non-NULL Value
----------- ---------- ----------- -------- -------------------
101         2001-01-01 CreditScore 541      541
101         2001-01-01 Height      160      160
101         2001-01-01 Postcode    W1A 1AA  W1A 1AA
101         2010-01-01 CreditScore 542      542
101         2010-01-01 Height      161      161
101         2010-01-01 Postcode    NULL     W1A 1AA
101         2010-02-02 CreditScore NULL     542
101         2010-02-02 Height      162      162
101         2010-02-02 Postcode    NULL     W1A 1AA
101         2010-02-03 CreditScore 538      538
101         2010-02-03 Height      NULL     162
101         2010-02-03 Postcode    NULL     W1A 1AA
101         2011-03-03 CreditScore 536      536
101         2011-03-03 Height      163      163
101         2011-03-03 Postcode    NULL     W1A 1AA
101         2012-12-30 CreditScore NULL     536
101         2012-12-30 Height      NULL     163
101         2012-12-30 Postcode    SW1A 2AA SW1A 2AA
102         2001-01-01 CreditScore 602      602
102         2001-01-01 Height      150      150
102         2001-01-01 Postcode    SW1A 1AA SW1A 1AA
102         2015-07-15 CreditScore NULL     602
102         2015-07-15 Height      NULL     150
102         2015-07-15 Postcode    SE1 9TG  SE1 9TG
102         2015-08-21 CreditScore NULL     602
102         2015-08-21 Height      NULL     150
102         2015-08-21 Postcode    SE1 9DT  SE1 9DT
102         2016-04-04 CreditScore 606      606
102         2016-04-04 Height      NULL     150
102         2016-04-04 Postcode    NULL     SE1 9DT
102         2017-05-05 CreditScore NULL     606
102         2017-05-05 Height      148      148
102         2017-05-05 Postcode    NULL     SE1 9DT

(33 rows affected)

All that’s left to do is PIVOT up these values into one row per PersonID/Date:

SELECT *
	FROM (
		SELECT
				PersonID, [Date], [Field], [Last non-NULL Value] AS [Value]
			FROM #LastKnownNull
	) x
	PIVOT
	(
		MIN([Value]) FOR [Field] IN ([CreditScore],[Height],[Postcode])
	) p
	ORDER BY PersonID, [Date]
GO
PersonID    Date       CreditScore Height   Postcode
----------- ---------- ----------- -------- --------
101         2001-01-01 541         160      W1A 1AA
101         2010-01-01 542         161      W1A 1AA
101         2010-02-02 542         162      W1A 1AA
101         2010-02-03 538         162      W1A 1AA
101         2011-03-03 536         163      W1A 1AA
101         2012-12-30 536         163      SW1A 2AA
102         2001-01-01 602         150      SW1A 1AA
102         2015-07-15 602         150      SE1 9TG
102         2015-08-21 602         150      SE1 9DT
102         2016-04-04 606         150      SE1 9DT
102         2017-05-05 606         148      SE1 9DT

And we’re done: from a list of changes, we’ve recreated the record as it was at the time of each change. This will allow to us to more easily write queries concerned with how our data looked at any given point in time.

, , , ,

Leave a comment

Regression: The analyst’s penknife

At work, I build regression models all the time; usually logistic regression, where the target variable is binary (0/1 or no/yes), e.g.:

  • Sales and marketing models (predicting which potential leads will convert)
  • Credit risk accept/reject scorecards (if we lend this person money, will they pay us back?)
  • Collections models (which of our customers is most likely to default in the next month?)
  • Fraud models (is this customer who they say they are?)

, but occasionally linear regression models too, where the target is a number (e.g. number of sales). The modelling process is basically the same between linear and logistic regression, it’s just that the model output requires a different interpretation.

These models are all probabilistic – that is, there is a component of randomness in the model that means there can never be a mathematical formula that produces the true output for any given inputs. But regression can help us do more than that. I want to mention a couple of times where I’ve used linear regression to uncover deterministic equations; that is, an exact and precise mathematical description that links the inputs to the output.

1. Decrypting a secret number (sort of)

A company presented questionnaires to people, where each question had multiple-choice answers. For various reasons, the answers were required to be secret within the database – an average developer was able to see the data, but they weren’t supposed to be able to link answers to a person. (This wasn’t ‘top secret’ data; the secrecy was more for marketing purposes than anything.)

The database tables looked something like this (I’ve only shown the important columns and keys):

CREATE TABLE dbo.Person
(
	PersonID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire
(
	QuestionnaireID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire_Completed
(
	QuestionnaireID INT NOT NULL
	,PersonID INT NOT NULL
	,PRIMARY KEY CLUSTERED (QuestionnaireID, PersonID)
)

CREATE TABLE dbo.Question
(
	QuestionID INT NOT NULL PRIMARY KEY CLUSTERED
	,QuestionnaireID INT NOT NULL
)

CREATE TABLE dbo.Answer
(
	AnswerID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
	,QuestionID INT NOT NULL
	,AnswerKey INT NOT NULL		-- mysterious!
)

For each answer, we knew what question and therefore questionnaire it related to, so we could do:

SELECT
		q.QuestionID
		,a.AnswerID
		,COUNT(1) AS N
	FROM dbo.Answer a
	JOIN dbo.Question q
		ON q.QuestionID = a.QuestionID
	WHERE q.QuestionnaireID = 123
	GROUP BY q.QuestionID, a.AnswerID

to get the results for a particular questionnaire. This whole process was built and administered by a single person. One day, that person was unavailable, and there was an urgent problem relating to a particular questionnaire; for legitimate reasons, a higher-up needed to know the answers a person had given.

Now, if the table Questionnaire_Completed had a field showing when the entry was inserted (a default Create date, see this post, for instance), then matching the data would’ve been trivial: the order in which the questionnaires were done would’ve almost certainly matched between the Questionnaire_Completed and Answer tables. However, we didn’t have that date; and it was probably not recorded for this very reason, to prevent correlating the pieces of information that would instantly reveal who answered what.

What we did have was the mysterious field AnswerKey in the Answer table. It wasn’t immediately obvious what it represented; it was just an integer that appeared random:

1763546779
-1375398177
406260249
-241234845

, so what was it? Clearly, it was very likely related to the PersonID (otherwise why obfuscate it?), but scrambled in a way to make it purposely difficult to join back. What to do? Well, linear regression tells you how numbers relate to other numbers, statistically, so that seemed a good place to start.

I created a rectangular dataset of what we had: QuestionnaireID, QuestionID, AnswerID, and AnswerKey (several thousand rows), and imported the data into R. Using linear regression (the glm function), I pretty quickly worked out that AnswerKey was of the form:

AnswerKey = k + a * QuestionnaireID + b * QuestionID + {SOMETHING ELSE}

How did I know this? Easy: the constant k, and parameters a and b were suspiciously close to being whole numbers, something that’s unlikely to happen by chance.

But what was the {SOMETHING ELSE}? It didn’t take very long to figure out that is was a simple linear function of PersonID and a fixed integer, one for each questionnaire. Given that we knew which PersonIDs had completed which questionnaire, we had a reduced set of possibilities — only certain values made sense.

The final formula for AnswerKey was:

k + a * QuestionnaireID + b * QuestionID + c * {Questionnaire Secret Number} + d * PersonID

, hence we could work out the PersonID for every record in the Answer table because we knew all the other pieces of information.

Overall, it probably took less than an hour to ‘crack’ the formula for the AnswerKey. As I said at the beginning, this wasn’t information that needed industrial-strength security, so the method used to hide it wasn’t inappropriate. This method of encrypting values isn’t something anyone should consider for sensitive information, e.g. bank details. For that sort of information, SQL Server provides some far better tools: see here. (Or even better: don’t store the data yourself, get a third party company to do it!)

2. Payments on loans with existing charges

NB: I’m working on a post and document which explains this piece of work in much greater detail – watch this space!

Let’s say a customer takes out a loan for £500 over 3 months, and the interest rate is 1% per day (this is a very high interest rate, illustrative purposes only!). If the months are January-March in 2019, then there are 31 days in month 1, 28 days in month 2, 31 days in month 3. How much should the customer pay off each month if (a) the loan is to be completely paid off after the 3rd payment, and (b) the payments are equal in amount?

This isn’t a very straightforward calculation, as the lengths of the months are unequal, and the amount of outstanding principal balance (OPB) that accrues interest will be decreasing each month. The answer doesn’t really matter here – but it’s £275.49 for the first 2 payments, and £275.47 as the final payment (2 pence less, due to rounding). In total, they will pay back £826.45, £326.45 in interest(!)

Now, I inherited some code that calculated these payment amounts, but it was very messy, and there was even a separate function for each number of months. Starting from first principles, I worked out a single equation that took the number of months as a parameter. This worked fine for calculating initial payment schedules, i.e. those for new customers.

However, the business also needed an equation for when payments were being ‘rescheduled’, and there were charges present – e.g. our customer had paid off some of the loan, but had incurred a late fee (£10 say), and was now asking to spread the repayments over 4 months instead of 3. This is where it gets trickier: charges don’t (legally, can’t) accrue interest, so our original equation no longer works. And it was far from clear how to adapt what we had to include this separate balance that didn’t accrue interest.

It’s actually possible to numerically calculate the payments for any schedule, all you need is code to step through the process of accrual and payment. You put your guess in at the top of the code, and see if the balance is zero at the end. If not, adjust the guess slightly, run it through again – keep going, until you get a balance of zero. This is standard stuff, e.g. see Newton’s method for an optimised way of getting to the answer faster.

So for any input we care to put in, we can actually get the correct answer. Doesn’t this automatically solve our problem? Why do we need the precise equations? Three reasons: (1) an equation will give us exact answers, and we’ll understand the precision of the answers we’re getting; (2) an equation will (well, should) take fewer steps to work through, hence less server processing; (3) the satisfaction of working it out!

At this point, I could generate the output (monthly payment) for any input (loan amount, interest rate, number of periods + their lengths, outstanding charges balance), but I just didn’t know how they were related exactly.

I randomly generated values for tens of thousands of inputs, calculated the output using the iterative procedure described above, and made a rectangular dataset. In R, as before, I used glm to start fitting models that helped me identify the relationship between the input and output values; in fact, I came up with code to auto-generate thousands of model formulas, and let the computer do even more of the work for me. For instance, if we have inputs A, B, C, D and an output E, then we can let the computer come up with new formulas at random:

E ~ A + D + (1/B)
E ~ A + B + (C/D) + ((A-B)/(C+D))
E ~ (C-B+A)/(D+A) + (B*C*D)/(A-3*D)
E ~ (A-3*B-C^2)/(sqrt(D+3)-sqrt(A*B))

etc. (The algorithm was a bit smarter than this, I’m just giving a flavour here.) It took a few evenings of trial and error playing with the code, but eventually, this approach worked, and I zoned in on the exact equations I needed. Unfortunately, it turns out that there’s no single equation, but multiple for each number of periods – however, they’re very ‘constructable’, the resulting function in code is very straightforward. I didn’t yet understand how to derive these equations from first principles, that was the next step – but I knew what I was aiming for, which made that task much easier.

, ,

Leave a comment

Aggregating all the categorical data in a table

One of the measures used to compare datasets for similarity is the Population Stability Index (PSI) value. (See here or here for details on how to calculate it.) It’s used for comparing, e.g. one financial quarter with another, or a set of training data to a set of test data. You generate the PSI value for each variable that exists in both datasets, and very simply, a low value implies two distributions are the same, a high value implies they are different. If the PSI values for every variable are below a certain threshold, then you can infer the two datasets follow the same overall distribution, and that models built over one dataset will work satisfactorily on the other.

We often need to calculate the PSI value for all categorical variables in a dataset; in SQL, this translates to finding the COUNTs for all the possible values in a column, for all columns, where the column is usually of datatype (N)VARCHAR.

Our example dataset, stored as a table MyCategories in SQL Server, has 10,000 rows and 6 columns: one integer primary key column (ID), and five VARCHAR:

SELECT TOP 100 * FROM dbo.MyCategories
ID          Cat1       Cat2       Cat3       Cat4       Cat5
----------- ---------- ---------- ---------- ---------- ----------
1           Alan       Emma       South      Start      Up
2           Bobby      Donna      North      Middle     Down
3           Bobby      Belinda    North      NULL       NULL
4           David      Donna      NULL       Start      NULL
5           Charlie    Anna       West       Start      Up

To perform our PSI calculation across all variables at once, we require a dataset consisting of all the different category values, and their counts. Naively, we could just do this:

SELECT Category = 1, Cat1 AS [Value], COUNT(1) AS N FROM dbo.MyCategories GROUP BY Cat1
UNION ALL SELECT Category = 2, Cat2, COUNT(1) FROM dbo.MyCategories GROUP BY Cat2
UNION ALL SELECT Category = 3, Cat3, COUNT(1) FROM dbo.MyCategories GROUP BY Cat3
UNION ALL SELECT Category = 4, Cat4, COUNT(1) FROM dbo.MyCategories GROUP BY Cat4
UNION ALL SELECT Category = 5, Cat5, COUNT(1) FROM dbo.MyCategories GROUP BY Cat5

which gives us the required result set:

Category Value      N
-------- ---------- -----
1        NULL       832
1        Alan       1650
1        Bobby      1644
1        Charlie    1604
1        David      1612
1        Eric       1779
1        Frank      879
2        NULL       1003
2        Anna       2042
...

This is fine for a few columns. But what about when you have hundreds?

We could generate our list of UNION ALLs, one for each of our columns, by various mechanisms — I often use Excel for things like this. But we’ll do it differently here, by using a loop. Firstly, let’s get our columns of interest and store them in a temp table:

IF OBJECT_ID('tempdb.dbo.#Columns','U') IS NOT NULL
BEGIN
	DROP TABLE #Columns
END
GO

CREATE TABLE #Columns
(
	colid INT NOT NULL PRIMARY KEY CLUSTERED
	,colname VARCHAR(100) NOT NULL
)
GO

;WITH cte_Columns AS
(
	SELECT
			ORDINAL_POSITION
			,COLUMN_NAME
		FROM INFORMATION_SCHEMA.COLUMNS
		WHERE TABLE_SCHEMA = 'dbo'
		AND TABLE_NAME = 'MyCategories'
		-- Your selection criteria here;
		-- in my case, it was all columns of type VARCHAR
		AND DATA_TYPE = 'varchar'
)
INSERT #Columns
(
	colid
	,colname
)
	SELECT
			colid	 = ROW_NUMBER() OVER (ORDER BY ORDINAL_POSITION)
			,colname = COLUMN_NAME
	FROM cte_Columns
GO

SELECT * FROM #Columns
GO

colid colname
----- -------
1     Cat1
2     Cat2
3     Cat3
4     Cat4
5     Cat5

Next, we’ll use a WHILE loop to do the aggregation for each column:

IF OBJECT_ID('tempdb.dbo.##Results','U') IS NOT NULL
BEGIN
	DROP TABLE ##Results
END
GO

CREATE TABLE ##Results
(
	[var] VARCHAR(100) NOT NULL
	,i INT NOT NULL
	,[value] VARCHAR(100) NULL
	,N INT NOT NULL
	,Total INT NOT NULL
)
GO

DECLARE @id INT, @maxid INT, @colname VARCHAR(100), @sql NVARCHAR(MAX)
SELECT @id = 1, @maxid = MAX(colid) FROM #Columns

WHILE (@id <= @maxid)
BEGIN

	SET @colname = NULL
	SET @sql = NULL

	SELECT @colname = colname FROM #Columns WHERE colid = @id

	SET @sql = N'
		INSERT ##Results
		(
			[var]
			,i
			,[value]
			,N
			,Total
		)
		SELECT
			''' + @colname + ''' as [var]
			,ROW_NUMBER() OVER (ORDER BY ' + @colname + ') AS i
			,' + @colname + ' AS [value]
			,COUNT(1) AS N
			,SUM(COUNT(1)) OVER () AS Total
			FROM dbo.MyCategories
			GROUP BY ' + @colname + '
			ORDER BY ' + @colname 

	EXEC sp_executesql @sql

	SET @id = @id + 1

END
GO

SELECT * FROM ##Results ORDER BY [var], i
GO
var  i    value      N    Total
---- ---- ---------- ---- -----
Cat1 1    NULL       832  10000
Cat1 2    Alan       1650 10000
Cat1 3    Bobby      1644 10000
Cat1 4    Charlie    1604 10000
Cat1 5    David      1612 10000
Cat1 6    Eric       1779 10000
Cat1 7    Frank      879  10000
Cat2 1    NULL       1003 10000
Cat2 2    Anna       2042 10000
...

The results are the same, ignoring a couple of handy extra columns that we’ll need later in our calculation.

Some notes:

  • Everything between ‘SET @sql = N’ and the EXEC is inside a string, and sp_executesql executes (runs) the SQL contained within the string variable @sql. This technique of programmatically building SQL within a string and then executing it is referred to as ‘dynamic SQL’. (If you’ve not encountered it before, putting an N directly before the first quote means the string is ‘wide’, of type NVARCHAR.)
  • Why do I bother setting the variables @colname and @sql to NULL, before immediately populating them in the next two statements? Bitter experience! If the code that follows doesn’t work somehow, your local working variables might end up with values you’re not expecting, hence it’s just good practice to set them to NULL at the start of the loop. (Believe me, it can save a lot of head-scratching and debugging.)
  • This might look like a lot of code, but it only takes a few minutes to write; I tend to follow the same pattern for all my SQL loops.

You’re probably thinking “Is there a shorter way of doing this?”. The answer is: yes, but with caveats. Here’s one approach – UNPIVOT our columns of interest, then aggregate:

;WITH cte_UnpivotCats AS
(
	SELECT
			ID
			,[var]
			,[value]
		FROM dbo.MyCategories
		UNPIVOT
		(
			[value] FOR [var] IN ([Cat1],[Cat2],[Cat3],[Cat4],[Cat5])
		) x
)
	SELECT
			[var]
			,[value]
			,COUNT(1) AS N
		FROM cte_UnpivotCats
		GROUP BY [var]
			,[value]
GO

var  value      N
---- ---------- ----
Cat1 Alan       1650
Cat1 Bobby      1644
Cat1 Charlie    1604
Cat1 David      1612
Cat1 Eric       1779
Cat1 Frank      879
Cat2 Anna       2042
Cat2 Belinda    1995
...

Spotted the problem? UNPIVOT doesn’t return any data when the [value] is NULL. If your data absolutely doesn’t have any NULLs, then yes, this works. But if you’ve got hundreds of columns, then you’ll still have to build up the list of columns within the UNPIVOT as a list, in order to execute it as dynamic SQL. (Oh, and you’ll have to ensure that the variables are all of the exact same type and length, UNPIVOT is very fussy about that!)

Here’s a modification that uses CROSS JOIN instead of UNPIVOT (adapted from a post on stackoverflow), which preserves the NULLs; but as above, you still have to work out how you’re going to code this if you have hundreds of columns.

;WITH cte_UnpivotCats AS
(
	SELECT
			ID
			,[var] = x.colname
			,[value] = CASE x.colname
						WHEN 'Cat1' THEN m.Cat1
						WHEN 'Cat2' THEN m.Cat2
						WHEN 'Cat3' THEN m.Cat3
						WHEN 'Cat4' THEN m.Cat4
						WHEN 'Cat5' THEN m.Cat5
					END
		FROM dbo.MyCategories m
		CROSS JOIN (
			VALUES
				('Cat1'),('Cat2'),('Cat3'),('Cat4'),('Cat5')
		) x(colname)

)
	SELECT
			[var]
			,[value]
			,COUNT(1) AS N
		FROM cte_UnpivotCats
		GROUP BY [var]
			,[value]
GO
var  value      N
---- ---------- ----
Cat1 NULL       832
Cat1 Alan       1650
Cat1 Bobby      1644
Cat1 Charlie    1604
Cat1 David      1612
Cat1 Eric       1779
Cat1 Frank      879
Cat2 NULL       1003
Cat2 Anna       2042
...

And we have our NULLs back. Again, we’ll have to use dynamic SQL if we want to build our query programmatically due to having many columns. But we now have our table of aggregate values and counts; if we have this data for both datasets, then calculating the PSI value is easy.

As for efficiency, in terms of ‘logical reads’ alone, then the CROSS JOIN method is the worst, the UNPIVOT the best; but of course, the UNPIVOT doesn’t preserve NULLs. The loop method is in-between the two; if there are appropriate indexes, then this will of course help. Having said that, (a) these queries aren’t generally something you run very often, and (b) all of the queries completed in < 10 milliseconds on my under-powered desktop PC.

The main reason I use the loop method, rather than CROSS JOIN, is that it enables me to ‘get in there’, and make small adjustments to the code as I find it necessary – e.g., there might be valid reasons for excluding particular values from an aggregate if they don’t exist in one or other of the datasets. It’s more difficult to do that when using UNPIVOT or CROSS JOIN.

, , , ,

Leave a comment

Don’t use functions on the left-hand side of a condition in SQL

I’ve seen this pattern in production code several times within the past month, so thought I’d explain here why it’s a bad idea. Simply put, don’t write queries that look like this:

SELECT Field1, Field2, ...
FROM MyTable
WHERE SomeFunction(someColumn) = someValue

, i.e. with a function applied to a column on the left-hand side of the condition in the WHERE clause. In the recent examples I’ve seen, the WHERE clause usually contains something date-related like this:

WHERE YEAR(someDate) = 2020

or

WHERE YEAR(someDate) = 2020
AND MONTH(someDate) = 5

These queries always run worse than if the condition had been re-written so that the functions were on the right-hand side (or removed altogether). The reasoning is this:

  • When putting together an execution plan for the query, the SQL Engine wants to use appropriate indexes to get the best performance;
  • These indexes have been previously created using columns in the table;
  • BUT functions of the columns used in the index might not be ordered in the same way;
  • Because of this, the parser won’t use the index in the way it’s supposed to, and often the whole table is scanned — which could be very costly in terms of performance.

[There’s so much that could be written here about statistics, parsing, etc., but I’m trying to keep it simple]

This sort of condition (where there’s a function on the left-hand side) has a name: it’s called non-sargable. Conversely, “a condition (or predicate) in a query is said to be sargable if the DBMS engine can take advantage of an index to speed up the execution of the query.” (See Wikipedia)

Here’s a demonstration. I’ve got a table where each row just contains an ID and a date:

CREATE TABLE dbo.DateTest
(
	ID INT NOT NULL PRIMARY KEY CLUSTERED
	,[OurDate] DATETIME NOT NULL
)
GO

I’ve filled it with one million rows: ID = 1 to 1000000, and OurDate is a random date (down to a resolution of 1 second) between the years 2013 and 2020.

Of course, we want an index on our datetime column:

CREATE INDEX IX_DateTest_OurDate ON dbo.DateTest(OurDate)

Now let’s run a couple of queries at the same time (‘in the same batch’). The first is ‘non-sargable’, because it uses the YEAR function on the left-hand side of the equals. In the second query, we’ve written it to be ‘sargable’ — but it’s the exact same logic, and should return the same answer.

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE YEAR(OurDate) >= 2019

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE OurDate >= '2019Jan01'

They do both return the same answer: 180348. But their execution plans are different (look at the bits highlighted in yellow):

Execution plan showing the second query is better

The first query scanned (i.e. read through) the whole index, and the query cost was 83% of the batch, compared to the second query which used the index to ‘seek’ the correct data, and was only 17% of the batch.

[Wait, I hear you say — the first query didn’t scan the whole table, it used the index, so that’s ok? Sadly not; it only used the index because our example is so simple that it can calculate the COUNT(1) result from the index alone. In most real-world examples, it’ll scan the table (or the clustered index, if the table has one).]

Look at the two sets of query statistics:

Table 'DateTest'. Scan count 1, logical reads 2236, ...

SQL Server Execution Times:
   CPU time = 78 ms,  elapsed time = 126 ms.

Table 'DateTest'. Scan count 1, logical reads 407, ...

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 148 ms.

The first (non-sargable) query read 5 times as many rows as the second (sargable) query, and took nearly 5 times as much CPU time. (We’ll ignore the elapsed time, it’s not relevant here.)

The difference is even greater if we restrict our query to a single month:

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE YEAR(OurDate) = 2018
	AND MONTH(OurDate) = 5

SELECT 
		N = COUNT(1)
	FROM dbo.DateTest
	WHERE OurDate >= '2018May01'
	AND OurDate < '2018Jun01'

We get the same execution plans as before, except this time the query costs were even further apart: 98% and 2%! (Compared to 83% and 17% previously.) What about the statistics?

Table 'DateTest'. Scan count 1, logical reads 2236, physical reads 0, ...

SQL Server Execution Times:
   CPU time = 109 ms,  elapsed time = 113 ms.

Table 'DateTest'. Scan count 1, logical reads 34, physical reads 0, ... 

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 46 ms.

This time, the first (non-sargable) query read 65 times as many rows as the second (sargable) query, and took a lot more CPU time, comparatively. Clearly, it’s worthwhile to ensure all your queries are ‘sargable’.

A final note: In the version of SQL Server I’m using (2017), the parser can’t even cope with conditions that look easily fixable; e.g. a WHERE clause like this:

WHERE ID >= 100

uses an index seek, but this:

WHERE ID+1 >= 101

uses a table scan! Hopefully, future versions of SQL will be able to spot and re-arrange obvious cases like this.

, , , , ,

Leave a comment