Posts Tagged sql server

Inexact date grouping using islands

A few years ago at SQLBits, I was fortunate enough to attend a fascinating lecture given by SQL Server MVP Itzik Ben-Gan on Gaps and islands:

Gaps and islands problems involve finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates.

Recently, I had to produce a dataset which showed how our interest rates varied over time, by product; for example, products A to E started at 10% through to 50%, but have been adjusted periodically to where they are today, a few points different. Practically, most of the changes have been made at or around the same time — but not exactly. For technical reasons, the specified rates aren’t stored in the database, so there’s no InterestRate table, or a parent table called InterestRateSet that links them together. However, the final result of an application is stored, so we know that a product has been sold, and what the corresponding interest rate was on that day.

The challenge was to work out how many sets of interest rates we’ve had since day 1; but because not every product is purchased every day, if we group by product/day, then it looks like our rates change more often than they do. This is where the ‘gaps and islands’ concept comes in, and luckily I remembered the lecture from a few years before. I found and tweaked some of Itzik’s code from a 2012 article Solving Gaps and Islands with Enhanced Window Functions (SQL Server Pro website) to accept a UDT (in this case, a list of dates). [See previous post, Passing structured data to stored procedures.]

Here it is:

-- Drop/Create our UDT:
IF EXISTS (SELECT * FROM sys.types WHERE is_table_type = 1 AND name = 'DateListType')
	DROP TYPE DateListType
GO
CREATE TYPE DateListType AS TABLE ([Date] DATE)
GO

-- Drop/Create our function:
IF OBJECT_ID('dbo.fn_GetIslandsFromDateList') IS NOT NULL
	DROP FUNCTION dbo.fn_GetIslandsFromDateList
GO
CREATE FUNCTION dbo.fn_GetIslandsFromDateList 
(
	@dateListType DateListType READONLY
	,@GapSize INT
)
RETURNS TABLE 
AS
RETURN 
(
WITH cte_Distinct AS
(
    SELECT
	  [Date]
    FROM @dateListType
    GROUP BY [Date]

), cte_Part1 AS
(
    SELECT
      [Date]
      ,CASE
        WHEN DATEDIFF(day, LAG([Date]) OVER(ORDER BY [Date]), [Date]) <= @GapSize
        THEN 0 ELSE 1 END AS IsStart
      ,CASE
        WHEN DATEDIFF(day, [Date], LEAD([Date]) OVER(ORDER BY [Date])) <= @GapSize
        THEN 0 ELSE 1 END AS IsEnd
    FROM cte_Distinct
)
, cte_Part2 AS
(
    SELECT
      [Date] AS RangeStart
      ,CASE
        WHEN IsEnd = 1
        THEN [Date] ELSE LEAD([Date], 1) OVER(ORDER BY [Date]) END AS RangeEnd
      ,IsStart
    FROM cte_Part1
    WHERE IsStart = 1 OR IsEnd = 1
) 
    SELECT
      ROW_NUMBER() OVER (ORDER BY RangeStart) AS ID
      ,RangeStart
      ,RangeEnd
    FROM cte_Part2
    WHERE IsStart = 1
)
GO

Some things to note:

  • dbo.fn_GetIslandsFromDateList is an inline function, which you can almost think of as ‘a view that takes parameters’ (a parameterised view).
  • You can use CTEs (Common Table Expressions) in inline functions. I love using CTEs, they can make the code very readable. Often, the parser turns them into standard sub-queries, so there’s no performance hit.
  • The @GapSize parameter controls how far apart our islands can be — see the examples below.
  • If you follow the code through, and break it down in to its component parts, you can see how it works — like all the best code, it’s very neat and compact.
  • To re-emphasise, this isn’t my algorithm, it’s Itzik Ben-Gan’s; I’ve done little more than re-format it for my own use.

Let’s feed some dates into our function:

DECLARE @myList DateListType
INSERT @myList([Date])
    VALUES('2017Feb01'),('2017Feb02'),('2017Feb03')
      ,('2017Feb06'),('2017Mar01'),('2017Mar02')

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO

ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-03
2    2017-02-06 2017-02-06
3    2017-03-01 2017-03-02

With a @GapSize of 2, we get 3 ranges (islands). With a @GapSize of 3:

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO
ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-06
2    2017-03-01 2017-03-02

, we get 2 ranges, because the difference in days between 2017-02-06 and 2017-02-03 is less than or equal to 3.

So this code did the trick, and allowed us to work out exactly how many different sets of rates we’d actually had live. Yes, we could’ve worked it out by hand; but now we’ve got some reproducible code that can drive various different reports, that’ll show us exactly how our changes have affected the business.

A final thought: Quite often, solving a problem comes down to just knowing the right phrase to google!

, , ,

Leave a comment

Passing structured data to stored procedures

As I wrote about here, I like to pass structured data around using XML (where applicable). While I think it’s the most flexible way, it’s understandable that people might not want to learn XML-parsing syntax, if they don’t have to. Another way of passing data into stored procedures (sprocs) is to use user-defined types (UDTs).

Here’s a simple example:

-- If the UDT already exists, delete it:
IF EXISTS (
	SELECT *
		FROM sys.types
		WHERE is_table_type = 1
		AND [name] = 'ListOfDates'
		AND [schema_id] = SCHEMA_ID('dbo')
	)
BEGIN
	DROP TYPE dbo.ListOfDates
END
GO

-- Create our UDT:
CREATE TYPE dbo.ListOfDates AS TABLE
(
	DateTypeID TINYINT NOT NULL
	,[Date] DATE NOT NULL
)
GO

-- Let's test it out:

SET DATEFORMAT YMD

DECLARE @MyListOfDates AS dbo.ListOfDates

INSERT @MyListOfDates(DateTypeID, [Date])
	VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03')
	,(2, '2016Oct10'),(2, '2017Jan01')

SELECT * FROM @MyListOfDates
GO

DateTypeID Date
---------- ----------
1          2016-01-01
1          2016-01-02
1          2016-01-03
2          2016-10-10
2          2017-01-01

Hopefully, that looks straightforward; the way we’ve used it here is not dissimilar to a table variable (e.g. DECLARE @MyTable AS TABLE...), but we can’t pass a table variable to a sproc. We’ll create a test sproc and pass our new user-defined type to it:

-- If the sproc already exists, delete it:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
	DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create our sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
	@ListOfDates AS dbo.ListOfDates READONLY
)
AS
BEGIN
	
	SELECT
			DateTypeID
			,MIN([Date]) AS MinDate
			,MAX([Date]) AS MaxDate
		FROM @ListOfDates
		GROUP BY DateTypeID
		ORDER BY DateTypeID

END
GO

-- Test it out:
DECLARE @MyListOfDates AS dbo.ListOfDates
INSERT @MyListOfDates(DateTypeID, [Date])
	VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03')
	,(2, '2016Oct10'),(2, '2017Jan01')
	,(3, '2017Feb01'),(3, '2017Feb03'),(3, '2017Feb28')

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates
GO

DateTypeID MinDate    MaxDate
---------- ---------- ----------
1          2016-01-01 2016-01-03
2          2016-10-10 2017-01-01
3          2017-02-01 2017-02-28

See the READONLY in the CREATE PROCEDURE? Here’s what happens when you omit it:

The table-valued parameter "@ListOfDates" must be declared with the READONLY option.

This is one of the limitations of user-defined types: they can’t be used to pass data back. So we couldn’t make any changes to @ListOfDates in the sproc, and see those changes reflected outside of the sproc.

There’s another limitation: UDTs can’t be used cross-database. Here’s what happens:

The type name 'MyOtherDatabase.dbo.ListOfDates' contains more than the
maximum number of prefixes. The maximum is 1.

Even if you create the exact same type in another database, with the same name, it won’t work.

(Note that UDTs don’t have to be tables: go here [MSDN] for more info.)


Just for kicks, here’s how I’d replicate the above functionality, but using XML:

-- If the sproc already exists, delete it:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
	DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create the sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
	@ListOfDates XML
)
AS
BEGIN
	
	WITH cte_ExtractFromXML AS
	(
		SELECT 
				DateTypeID	= d.i.value('(@type)[1]', 'INT')
				,[Date]		= d.i.value('.[1]', 'DATE')
			FROM @ListOfDates.nodes('//date') d(i)
	)
	SELECT
			DateTypeID
			,MIN([Date]) AS MinDate
			,MAX([Date]) AS MaxDate
		FROM cte_ExtractFromXML
		GROUP BY DateTypeID
		ORDER BY DateTypeID

END
GO

-- Test it with some XML data:
DECLARE @MyListOfDates XML
SET @MyListOfDates = '
<listOfDates>
	<date type="1">2016Jan01</date>
	<date type="1">2016Jan02</date>
	<date type="1">2016Jan03</date>
	<date type="2">2016Oct10</date>
	<date type="2">2017Jan01</date>
	<date type="3">2017Feb01</date>
	<date type="3">2017Feb03</date>
	<date type="3">2017Feb28</date>
</listOfDates>'

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates
GO

DateTypeID  MinDate    MaxDate
----------- ---------- ----------
1           2016-01-01 2016-01-03
2           2016-10-10 2017-01-01
3           2017-02-01 2017-02-28

If we wanted to, we could declare our @ListOfDates parameter as OUTPUT, and make changes to it in the sproc, e.g.:

Drop the sproc if it exists:
IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL
BEGIN
	DROP PROCEDURE dbo.usp_MySproc
END
GO

-- Create our (new and improved) sproc:
CREATE PROCEDURE dbo.usp_MySproc
(
	@ListOfDates XML OUTPUT
)
AS
BEGIN
	
	DECLARE @Aggregated XML

	;WITH cte_ExtractFromXML AS
	(
		SELECT 
				DateTypeID	= d.i.value('(@type)[1]', 'INT')
				,[Date]		= d.i.value('.[1]', 'DATE')
			FROM @ListOfDates.nodes('//date') d(i)
	)
	SELECT @Aggregated = (
		SELECT
				DateTypeID AS '@id'
				,MIN([Date]) AS '@MinDate'
				,MAX([Date]) AS '@MaxDate'
			FROM cte_ExtractFromXML
			GROUP BY DateTypeID
			ORDER BY DateTypeID
			FOR XML PATH('DataType'), ROOT('Aggregated')
	)

	SELECT @ListOfDates = (
		SELECT @ListOfDates, @Aggregated FOR XML PATH('Output'), TYPE
	)

END
GO

-- Run it:
DECLARE @MyListOfDates XML
SET @MyListOfDates = '
<listOfDates>
	<date type="1">2016Jan01</date>
	<date type="1">2016Jan02</date>
	<date type="1">2016Jan03</date>
	<date type="2">2016Oct10</date>
	<date type="2">2017Jan01</date>
	<date type="3">2017Feb01</date>
	<date type="3">2017Feb03</date>
	<date type="3">2017Feb28</date>
</listOfDates>'

EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates OUTPUT

SELECT @MyListOfDates AS MyListOfDates
GO

<Output>
  <listOfDates>
    <date type="1">2016Jan01</date>
    <date type="1">2016Jan02</date>
    <date type="1">2016Jan03</date>
    <date type="2">2016Oct10</date>
    <date type="2">2017Jan01</date>
    <date type="3">2017Feb01</date>
    <date type="3">2017Feb03</date>
    <date type="3">2017Feb28</date>
  </listOfDates>
  <Aggregated>
    <DataType id="1" MinDate="2016-01-01" MaxDate="2016-01-03" />
    <DataType id="2" MinDate="2016-10-10" MaxDate="2017-01-01" />
    <DataType id="3" MinDate="2017-02-01" MaxDate="2017-02-28" />
  </Aggregated>
</Output>

As you can see, we’ve returned our original list, along with the aggregated data, both under a new parent node.

, , , , ,

Leave a comment

XSDs FTW

I’m a big fan of passing data to and from stored procedures (sprocs) as XML, especially XML that represents a complete object, or a list or hierarchy of objects. For a start, XML is perfectly human-readable (if you’re doing it right), and nearly every system and language knows how to work with it, SQL Server / TSQL included. What makes it even better, is being able to validate the XML before you even begin to parse it, using an XSD (XML Schema Definition).

Here’s a complete example you can copy and run:

USE tempdb
GO

-- If the XSD already exists, drop it:

IF EXISTS (
  SELECT xsc.name 
    FROM sys.xml_schema_collections xsc
    WHERE xsc.name='TestSchema'
)
BEGIN
  DROP XML SCHEMA COLLECTION TestSchema
END
GO

-- Create the schema:

CREATE XML SCHEMA COLLECTION TestSchema AS '
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:simpleType name="ST_EmailAddress">
      <xsd:restriction base="xsd:string">
      <xsd:pattern value="[^@]*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9}"/>
      </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name="ST_Usage">
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="home"/>
      <xsd:enumeration value="work"/>
      <xsd:enumeration value="other"/>
    </xsd:restriction>  
  </xsd:simpleType>

  <xsd:complexType name="CT_EmailAndUsage">
    <xsd:simpleContent>
      <xsd:extension base="ST_EmailAddress">
        <xsd:attribute name="usage" use="required" type="ST_Usage" />
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

  <xsd:element name="emailList">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="email" type="CT_EmailAndUsage" minOccurs="1" maxOccurs="3" />
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>'
GO

-- Make some dummy data that conforms to the schema above:

DECLARE @testXML AS XML(TestSchema)

SET @testXML = '
<emailList>
  <email usage="home">pete@home.com</email>
  <email usage="work">pete@work.com</email>
  <email usage="other">pete@other.com</email>
</emailList>
'

-- Query it:

SELECT
    id = ROW_NUMBER() OVER (ORDER BY e.i)
    ,EmailAddress = e.i.value('(.)[1]','VARCHAR(255)')
    ,Usage = e.i.value('(@usage)[1]', 'VARCHAR(20)')
  FROM @testXML.nodes('//email') AS e(i)
GO

The result set is:


id   EmailAddress    Usage
---  --------------  ------
1    pete@home.com   home
2    pete@work.com   work
3    pete@other.com  other

(3 row(s) affected)

Now, try messing around with the contents of the @testXML variable, e.g.:

  1. Set usage to a string that’s not ‘home’, ‘work’ or ‘other’
  2. Add a fourth email address
  3. Take the ‘@’ symbol out of an email address
  4. Put in some extra nodes that don’t belong

,then re-run the code. They all fail, because the XML has to conform to the XSD we’ve defined as TestSchema. So, SQL Server automatically rejects any input that fails data validation (e.g. format of email address) or breaks business logic (‘no more than three emails’); if the XML was being passed to a sproc, the call would fail, and no code inside would ever run.

Obviously, you may not want to automatically reject ‘broken’ XML, you’ll probably want to record this fact. That’s fine – your code (sproc) can accept a schema-less XML, and attempt the cast itself; and if it fails, you can respond however you like.

There’s certainly an overhead in learning the language of XSDs, but because it’s a generic technology, there are plenty of online resources, e.g. w3schools. When it comes to transferring complex objects around as data, I don’t know of a better way than using XML and XSD.

Note

Because Microsoft haven’t got round to coding it yet, you can’t query the text value of a node that’s been defined as a type in XSD. That is, I’d ordinarily like to be able to query the email address itself like this:

EmailAddress = e.i.value('(./text())[1]','VARCHAR(255)')

, because directly accessing the node text is faster (presumably because it doesn’t have to do any more parsing). But sadly, it’ll just fail with an error. However, this is unlikely to cause practical problems, it’s just a mild annoyance that’s vastly outweighed by the benefits that come from validated XML.

, ,

Leave a comment

Floats may not look distinct

The temporary table #Data contains the following:


SELECT * FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)

Three copies of the same number, right? However:


SELECT DISTINCT value FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)

We have the exact same result set. How can this be?

It’s because what’s being displayed isn’t necessarily what’s stored internally. This should make it clearer:


SELECT remainder = (value - 123.456) FROM #Data
GO

remainder
----------------------
9.9475983006414E-14
1.4210854715202E-14
0

(3 row(s) affected)

The numbers aren’t all 123.456 exactly; the data is in floating-point format, and two of the values were ever-so-slightly larger. The lesson is: be very careful when using aggregate functions on columns declared as type float.

Some other observations:

  • The above will probably be reminiscent to anyone who’s done much text wrangling in SQL. Strings look identical to the eye, but different to SQL Server’s processing engine; you end up having to examine every character, finding and eliminating extraneous tabs (ASCII code 9), carriage returns (ASCII code 13), line-feeds (ASCII code 10), or even weirder.
  • If your requirement warrants it, I can thoroughly recommend the GNU Multiple Precision Arithmetic Library, which stores numbers to arbitrary precision. It’s available as libraries for C/C++, and as the R package gmp:

# In R:

> choose(200,50);  # This is 200! / (150! 50!)
[1] 4.538584e+47
> library(gmp);
Attaching package: ‘gmp’
> chooseZ(200,50);
Big Integer ('bigz') :
[1] 453858377923246061067441390280868162761998660528

# Dividing numbers:
> as.bigz(123456789012345678901234567890) / as.bigz(9876543210)
Big Rational ('bigq') :
[1] 61728394506172838938859798528 / 4938271605
# ^^ the result is stored as a rational, in canonical form.

, , , ,

Leave a comment

The meaning of NULL, and why magic data is bad

Part 1: NULL

As a junior web developer, I remember other developers warning me about database NULLs: “you don’t really want to deal with them”, “code them out if you can”, “program around them”, “turn them into something else”. Dear reader, they were quite wrong.

For a piece of data, a NULL value can mean any of:

  1. We don’t know this.
  2. We don’t know this yet (but there’s an expectation we’ll know this at a later date).
  3. We don’t know this, because the question of “What is the value of X for object Y?” is not applicable here.
  4. We don’t know this, and the chances of us ever knowing it are practically zero.
  5. It is logically impossible for this data to exist.

Context usually clues us into which meaning of NULL we’re looking at; if a customer’s email address field is NULL, then in terms of the above:

  1. It hasn’t been asked for / provided / collected.
  2. It hasn’t been asked for / provided / collected yet, but we might be getting this data in the future.
  3. The customer doesn’t have an email address.
  4. The rest of the customer’s data is incorrect or missing, so we have no means of contacting them to find their email address.
  5. The ‘customer’ isn’t something capable of owning an email address(*).

Regardless, if a customer doesn’t have an email address in the system, any code that consumes customer data will have to cope in a sensible manner. If the code is a data entry form, then an empty text field will be displayed; but if the code does marketing mail-outs, then it’ll just have to skip that record.

(*) It could be that the table is ‘multi-use’, and the field makes no sense for some types of data.

Going back to meaning (5), a couple of better examples might be:

  • ‘O’ level GCSE results (at age 16):   My age cohort did GCEs, therefore it is logically impossible for anyone to ascertain my GCSE results.
  • Date of last gynaecological exam:   Clearly, this would never be applicable for anyone born genetically male.

(In multivariate analysis, these would be referred to as structural zeros, rather than the usual sampling zeros. “It was impossible for it to occur” vs. “We did not see it occur”.)

Despite NULL being the very embodiment of “no information”, sometimes “no information” is the information in itself! Trivially, e.g., a SQL query to find all customers without email addresses, will specifically look for the NULL value in that field. Data with NULLs in be indexed, same as any other data. You can even create a filtered index that goes straight to the NULL data:

CREATE INDEX IX_Customer_EmailIsNULL
  ON dbo.Customer(Email)
  WHERE Email IS NULL

So a NULL value is generally not something to be avoided, modified, or coded around. It is eminently useful, and a vital part of your data structures.


Part 2: Bad magic

Now, I started with part 1 because of a common pattern I see used in data capture, usually due to novice / junior / misguided developers. An example: I have a database table of addresses (called Address), with the usual fields. My company operates strictly within the UK, so in an effort to keep our data as clean as possible, we have a CHECK constraint on the Postcode field; not a foreign key to a table of all postcodes ever (who wants to maintain that??), but a simple check against the UK postcode format. The check will prevent entries like “unknown”, or mistakes like “SW1A IAA” (‘I’ instead of ‘1’). Also, the postcode is ‘NOT NULL’-able — because every address has a postcode, right?

It might look like this:

CREATE TABLE xyz.[Address]
(
	AddressID INT NOT NULL PRIMARY KEY
	,Line1 VARCHAR(255) NOT NULL
	,Line2 VARCHAR(255) NULL
	,Postcode VARCHAR(10) NOT NULL
		CHECK (Postcode LIKE '[A-Z][0-9] [0-9][A-Z][A-Z]'
		OR Postcode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]')
)

(Clearly the CHECK constraint isn’t exhaustive: as it is, it’ll reject SW1A 1AA, the postcode of Buckingham Palace. It’ll do for illustrating the point.)

If customer data is supplied without a postcode, then any INSERT will fail. What tends to happen, is that over time, you’ll see the Postcode field start to contain values like ZZ1 1ZZ; a value that passes our simple CHECK constraint rules, but is probably not a valid UK postcode.

So how did ZZ1 1ZZ get into the database?

Scenario 1a:

The developer coding the application form tried to INSERT a record with no postcode, thus the operation failed with an error. So in the input form, they put some code to change a blank postcode to ZZ1 1ZZ when INSERT-ing.

Scenario 1b:

The customer input form hooks up to an address validator; if the address cannot be validated, then the customer is asked to fill in all the fields themself, and can easily put in an invalid postcode which doesn’t make it past the simple check constraint on the Address table. The developer dutifully catches the error, changes the postcode to ZZ1 1ZZ and re-INSERTs.

Scenario 2:

A customer complained about being marketed to, and needs to be removed from the database as soon as possible. To do it properly would mean changing code in several systems; the quick hack is to change their postcode to ZZ1 1ZZ, then make sure the mail-out query ignores records with that postcode value. This is then adopted as semi-official practice: “To remove a customer from marketing, just set their postcode to ZZ1 1ZZ.”

There are multiple problems with having a postcode of ZZ1 1ZZ meaning ‘unknown’, ‘error’ or ‘do not contact’:

  1. It’s a ‘magic’ string; for it to have system-wide meaning, every single system must understand it, and what it represents. What if someone INSERT-ed ZZ2 2ZZ? It wouldn’t be similarly understood, it would be treated as a real postcode.
  2. Every new developer and analyst has to be told about the magic string. What if there’s a magic string for every piece of data? Ok, it could be solved by using VIEWs, but then that’s more code that has to be known about, and scrupulously maintained.
  3. What if, by some mistake, post is sent out to that postcode? (This will happen, I guarantee it.) One of your other systems is likely recording the fact that mail has been sent correctly, but the chances of it arriving are slim.
  4. The real postcode ZZ1 1ZZ might not exist now, but it may in the future: there are many examples of postcodes falling into and out of use. How will you know if your postcode is genuine, or a magic string? Take note: postcodes that start ZZ99 are real live NHS “pseudo-postcodes”…

As you’ve probably realised, my answer would be to make the postcode field NULL-able(*), and to INSERT a NULL in the case of missing or broken data, completely avoiding any magic strings. It needs no special handling, and contextually, it very probably has a limited range of well-understood meanings; e.g. if you see a field MiddleName that is NULL for some records, you would presume it to mean the Customer has no middle name.

Note this is why in the email example in Part 1, we shouldn’t use a blank string instead of a NULL – because a blank string is still a ‘magic’ string, just one that would happen to be widely understood. There will be cases when a blank string legitimately means something quite different to a NULL.

(*) I’ve heard people claim that fields with CHECK constraints can’t be NULL-able. In modern flavours of SQL Server, this is demonstrably false. If the field is NULL, the constraint just isn’t checked.


Part 3: Keeping track

By clearing up one locus of ambiguity, I’m afraid I’m going to introduce a new one. Presumably, we’re going to want to record why our postcode field is NULL. We can either:

(A) Create a new lookup table, ReasonForNull (say); add a new field, ReasonForNullID, to our Address table, add a suitable foreign key, and a CHECK constraint that says “if the Postcode is NULL, then ReasonForNullID must not be NULL – and vice versa”, e.g.:

ALTER TABLE xyz.[Address]
ADD CONSTRAINT CK_Address_PostcodeOrReason 
CHECK( (Postcode IS NOT NULL AND ReasonForNullID IS NULL)
	OR (Postcode IS NULL AND ReasonForNullID IS NOT NULL)
)

or

(B) Create our new lookup table (as above), but also create another new table, Address_ReasonForNull, like so:

CREATE TABLE xyz.Address_ReasonForNull
(
	AddressID INT NOT NULL
		CONSTRAINT PK_Address_ReasonForNull
		PRIMARY KEY CLUSTERED
	,ReasonForNullID TINYINT NOT NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_Address_ReasonForNull_CreatedOn
		DEFAULT(GETDATE())
	,CONSTRAINT FK_Address_ReasonForNull_AddressID
		FOREIGN KEY (AddressID)
		REFERENCES xyz.Address(AddressID)
	,CONSTRAINT FK_Address_ReasonForNull_ReasonForNullID
		FOREIGN KEY (ReasonForNullID)
		REFERENCES xyz.ReasonForNull(ReasonForNullID)
)

and only INSERT into it when we have an invalid postcode.

Neither (A) nor (B) is a perfect solution. (A) will waste a byte per Address record (if ReasonForNullID is declared as a TINYINT) if the postcode is ok, but has the advantage of strictly maintaining integrity, thanks to the CHECK constraint. (B) wastes no space, but there is no simple way (that I know of) of enforcing that a child record must exist, given data in the parent record.

If we want to record, say, the postcode that was entered but not validated, then it’s no bother under scenario (B) to add a new field to our Address_ReasonForNull table:

ALTER TABLE xyz.Address_ReasonForNull
	ADD OriginalData VARCHAR(20) NULL

However, if we were doing (A), then we’d have to add this column to the main Address table (and change the CHECK constraint); potentially, we could waste a lot of space.

Personally, I’d favour (B), and would push for all data changes to be made via stored procedures (aka sprocs). That way, I can ensure that the data in my two tables is kept perfectly in sync.

Any thoughts or comments? Feel free to let us know!

, , , ,

Leave a comment

Enforcing type-specific data

“Yeah, yeah, but your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.” — Dr Ian Malcolm, Jurassic Park

(Forgive the over-used quote above, but it feels apt — I think you’ll see why.)

In our hypothetical situation, we have two types of Employee: a Manager and an Agent. So to begin with, we have Employee and EmployeeType tables:

CREATE TABLE dbo.EmployeeType
(
  EmployeeTypeID TINYINT NOT NULL
    CONSTRAINT PK_EmployeeType PRIMARY KEY CLUSTERED
  ,[Description] VARCHAR(50) NOT NULL
)
GO
INSERT dbo.EmployeeType(EmployeeTypeID, [Description])
    VALUES(1, 'Agent'), (2, 'Manager')
GO

We’ll say that Agents get a ‘terminal code’ (to access their terminals); Managers don’t need a terminal code, but they do get to use the staff car park, so we’ll need to store their car registration — but only for Managers, Agents aren’t allowed to park at work.

CREATE TABLE dbo.Employee
(
  EmployeeID INT NOT NULL
    CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED
  ,EmployeeTypeID TINYINT NOT NULL 
  ,[Name] VARCHAR(50) NOT NULL
  ,StartDate DATE NOT NULL
  -- 1. Agent-specific
  ,TerminalCode VARCHAR(5) NULL
  -- 2. Manager-specific
  ,CarReg VARCHAR(10) NULL
  ,CONSTRAINT FK_Employee_EmployeeTypeID
    FOREIGN KEY (EmployeeTypeID) REFERENCES dbo.EmployeeType(EmployeeTypeID)
)
GO

Let’s put some data in:

INSERT dbo.Employee
(
  EmployeeID
  ,EmployeeTypeID
  ,[Name]
  ,StartDate
  ,TerminalCode
  ,CarReg
)
  VALUES(1, 1, 'Alan Smith', '2001Jan01', 'AS111', NULL)
    ,(2, 1, 'Brenda Thomson', '2002Feb02', 'BT222', NULL)
    ,(3, 1, 'Colin Underwood', '2003Mar03', 'CU333', NULL)
    ,(4, 2, 'Diana Varney', '2004Apr04', NULL, 'DV65 QAZ')
    ,(5, 2, 'Edward Wilson', '2005May05', NULL, 'EW65 WSX')
GO

So far, so straightforward. Now let’s muck things up:

UPDATE dbo.Employee
  SET CarReg = 'AB91 EDC'
  WHERE EmployeeID = 1
GO

which of course works, but now our mere Agent gets to park for free. (How disastrous!) We can fix this — we’ll put the NULL back in the CarReg field, and add a CHECK constraint to stop it happening again:

UPDATE dbo.Employee
  SET CarReg = NULL
  WHERE EmployeeID = 1
GO

ALTER TABLE dbo.Employee
  ADD CONSTRAINT CK_Employee_CheckFields
    CHECK
      (
      CASE
        WHEN EmployeeTypeID = 1
        THEN CASE WHEN TerminalCode IS NOT NULL AND CarReg IS NULL THEN 1 ELSE 0 END
        WHEN EmployeeTypeID = 2
        THEN CASE WHEN TerminalCode IS NULL AND CarReg IS NOT NULL THEN 1 ELSE 0 END
      END = 1
      )
GO

If we try and set the CarReg for an Agent now, it’ll fail:

The UPDATE statement conflicted with the CHECK constraint "CK_Employee_CheckFields"

Well, that works! Job done, no? Not really… every time we need to add or remove Agent- or Manager-specific fields, we’ll have to change that CHECK constraint; what if there were 100 fields? The constraint could get pretty unwieldy. Not to mention the wasted column space: Agent fields will always be NULL for Managers, and vice versa.

Drop the CHECK constraint, the TerminalCode and CarReg fields in Employee; and then we’ll create two new tables to hold the type-specific data:

CREATE TABLE dbo.AgentData
(
  EmployeeID INT NOT NULL CONSTRAINT PK_AgentData PRIMARY KEY CLUSTERED
  ,TerminalCode VARCHAR(5) NULL
)
GO
INSERT dbo.AgentData(EmployeeID, TerminalCode)
  VALUES(1,  'AS111'), (2, 'BT222'), (3, 'CU333')
GO
CREATE TABLE dbo.ManagerData
(
  EmployeeID INT NOT NULL CONSTRAINT PK_ManagerData PRIMARY KEY CLUSTERED
  ,CarReg VARCHAR(10) NULL
)
GO
INSERT dbo.ManagerData(EmployeeID, CarReg)
  VALUES(4, 'DV65 QAZ'), (5, 'EW65 WSX')
GO

(Foreign keys on EmployeeID removed for clarity.)

That’s better! Except how do we enforce that AgentData only holds EmployeeIDs that relate to Agents, and that ManagerData only holds EmployeeIDs that relate to Managers? We could use a trigger, absolutely. But for the purposes of this post, we have an irrational fear of triggers and don’t want to use them! Can we enforce our rules with just keys?

(And here’s where the quote at the beginning becomes relevant…) Yes, I think we can, but I don’t know that we should. The following solution seems to work, but I don’t know if I’d ever use it in practise, it feels over-the-top. (I should also add, I’ve not tested it in a production-like environment, it might be unsuitable for some reason I’ve not yet discerned.) Anyway, here it is.

Starting from scratch:

CREATE TABLE dbo.Employee
(
  EmployeeID INT NOT NULL
        CONSTRAINT UQ_Employee_EmployeeID UNIQUE
  ,EmployeeTypeID TINYINT NOT NULL 
  ,[Name] VARCHAR(50) NOT NULL
  ,StartDate DATE NOT NULL
  ,CONSTRAINT PK_Employee
        PRIMARY KEY CLUSTERED (EmployeeID, EmployeeTypeID)
  ,CONSTRAINT FK_Employee_EmployeeTypeID
        FOREIGN KEY (EmployeeTypeID) REFERENCES dbo.EmployeeType(EmployeeTypeID)
)

We’ve given the Employee table a composite PRIMARY KEY, comprised of the EmployeeID, and the EmployeeTypeID. The EmployeeID also has a UNIQUE index on it, for two reasons: (1) obviously, to stop the same EmployeeID having multiple types, and (2) so that other tables can reference the EmployeeID in a FOREIGN KEY, without having to include the EmployeeTypeID as well. Without the UNIQUE constraint, we’d get the following error:

There are no primary or candidate keys in the referenced table 'dbo.Employee'
that match the referencing column list in the foreign key

, but with it, we’re ok. We’ll INSERT the same employee details as before:

INSERT dbo.Employee
(
  EmployeeID
  ,EmployeeTypeID
  ,[Name]
  ,StartDate
)
  VALUES(1, 1, 'Alan Smith', '2001Jan01')
    ,(2, 1, 'Brenda Thomson', '2002Feb02')
    ,(3, 1, 'Colin Underwood', '2003Mar03')
    ,(4, 2, 'Diana Varney', '2004Apr04')
    ,(5, 2, 'Edward Wilson', '2005May05')
GO

Here’s a slightly icky bit: we’re going to fix the EmployeeTypeID of 1 in the AgentData table:

CREATE TABLE dbo.AgentData
(
  EmployeeID INT NOT NULL CONSTRAINT PK_AgentData PRIMARY KEY CLUSTERED
  ,EmployeeTypeID AS CAST(1 AS TINYINT) PERSISTED
  ,TerminalCode VARCHAR(5) NULL
  ,CONSTRAINT FK_AgentData_EmployeeID_EmployeeTypeID
    FOREIGN KEY (EmployeeID, EmployeeTypeID) REFERENCES dbo.Employee(EmployeeID, EmployeeTypeID)
)
GO

, and fix the EmployeeTypeID of 2 in the ManagerData table:

CREATE TABLE dbo.ManagerData
(
  EmployeeID INT NOT NULL CONSTRAINT PK_ManagerData PRIMARY KEY CLUSTERED
  ,EmployeeTypeID AS CAST(2 AS TINYINT) PERSISTED
  ,CarReg VARCHAR(10) NULL
  ,CONSTRAINT FK_ManagerData_EmployeeID_EmployeeTypeID
    FOREIGN KEY (EmployeeID, EmployeeTypeID) REFERENCES dbo.Employee(EmployeeID, EmployeeTypeID)
)
GO

Where both tables have a FOREIGN KEY onto the Employee table that includes the EmployeeTypeID, there’s no way to get the types mixed up: an Agent has to be an Agent in both tables, likewise for Managers.

Let’s INSERT the data and check we can query it without any problems:

INSERT dbo.AgentData(EmployeeID, TerminalCode)
  VALUES(1,  'AS111'), (2, 'BT222'), (3, 'CU333')
GO
INSERT dbo.ManagerData(EmployeeID, CarReg)
  VALUES(4, 'DV65 QAZ'), (5, 'EW65 WSX')
GO
SELECT
    e.*
    ,ad.TerminalCode
    ,md.CarReg
  FROM dbo.Employee e
  LEFT JOIN dbo.AgentData ad
    ON ad.EmployeeID = e.EmployeeID
  LEFT JOIN dbo.ManagerData md
    ON md.EmployeeID = e.EmployeeID
  ORDER BY e.EmployeeID
GO
EmployeeID  EmployeeTypeID Name              StartDate  TerminalCode CarReg
----------- -------------- ----------------- ---------- ------------ ----------
1           1              Alan Smith        2001-01-01 AS111        NULL
2           1              Brenda Thomson    2002-02-02 BT222        NULL
3           1              Colin Underwood   2003-03-03 CU333        NULL
4           2              Diana Varney      2004-04-04 NULL         DV65 QAZ
5           2              Edward Wilson     2005-05-05 NULL         EW65 WSX

(5 row(s) affected)

We can also add FOREIGN KEY references from other tables, pointing at the EmployeeIDs on the AgentData and ManagerData tables, and get the expected behaviour, e.g.:

CREATE TABLE dbo.ManagerEmail
(
  EmployeeID INT NOT NULL
    CONSTRAINT PK_ManagerEmail PRIMARY KEY CLUSTERED
  ,EmailAddress VARCHAR(255) NOT NULL
  ,CONSTRAINT FK_ManagerEmail_EmployeeID
    FOREIGN KEY (EmployeeID) REFERENCES dbo.ManagerData(EmployeeID)
)
GO
INSERT dbo.ManagerEmail(EmployeeID, EmailAddress)
	VALUES(4, 'diana.varney@bigcorp.com')
GO
-- works fine, Diana is a Manager

INSERT dbo.ManagerEmail(EmployeeID, EmailAddress)
	VALUES(3, 'colin.underwood@bigcorp.com')
GO
-- Error:
-- "The INSERT statement conflicted with the
-- FOREIGN KEY constraint "FK_ManagerEmail_EmployeeID".

What do you think? Does this work for you, or would a trigger be more straightforward? As always, please let me know in the comments, I’d love to hear from you.

, , , ,

Leave a comment

The Request / Response pattern

Here’s something I come across too often: how do we deal with sending data to and receiving data from external systems? I don’t mean the format of the data, or the transmission mechanism; but simply, how do we keep a record of what occurred? (And I use the word ‘external’ here in a vague sense; it could mean external to your company, external to the current app domain, or just something that you want to keep separate.)

Let’s pretend our system needs to call out to Amazon to query the best price for a book: we send the Amazon service an ISBN, and it returns a number that represents the price.

The Novice Developer (TND) has created a service, underneath which is a table that looks like this:

CREATE TABLE dbo.BookPrice
(
	ISBN CHAR(13) NOT NULL PRIMARY KEY
	,LatestPrice SMALLMONEY NULL
)

(TND probably uses Entity Framework or similar to talk to the database, but I’ll pretend that the usual SQL statements are being used.)

To get a price, the service sends the ISBN to Amazon, gets a response, and INSERTs into the table:

INSERT dbo.BookPrice(ISBN, Price)
	VALUES('9780575082076', 8.99)

, or does an UPDATE if the record already exists. So far, so good. Next time TND runs the service, there’s a problem; the book is out of stock, so Amazon doesn’t return a price. The code does this:

UPDATE dbo.BookPrice
	SET LatestPrice = NULL
	WHERE ISBN = '9780575082076'

An analyst looks at the data, and asks, “Well, why’s that price NULL?”. TND explains, and the analyst points out that there’s no way of telling what a NULL means. It could be any of:

  • The service never attempted to call out to Amazon
  • The service tried to make the call, but there was a network problem
  • The service made the call correctly, but got an error back from Amazon

A manager overhears the conversation, and asks the TND to create a scheduled task that retries calls that didn’t return a price. To facilitate this, the developer changes the table:

ALTER TABLE dbo.BookPrice
	ADD LatestStatus VARCHAR(255) NULL
	, LastOKPrice SMALLMONEY NULL

Now, the UPDATE becomes:

UPDATE dbo.BookPrice 
	SET LatestPrice = NULL
		,LatestStatus = 'Out of stock'
		,LastOKPrice = 8.99
	WHERE ISBN = '9780575082076'

To get the latest price, TND tells the analyst to just run the following query:

SELECT ActualLatestPrice = ISNULL(LatestPrice, LastOKPrice)
	FROM dbo.BookPrice
	WHERE ISBN = '9780575082076'

Sometime later, TND’s service requests the latest price, and the network is down. The service notes this accordingly:

UPDATE dbo.BookPrice
	SET LatestStatus = 'Network problem'
	WHERE ISBN = '9780575082076'

The analyst then points out that the company has SLAs and needs to know if external service calls are failing. TND thinks, and adds a new column:

ALTER TABLE dbo.BookPrice
	ADD NetworkProblemCount INT NULL

and UPDATEs the table again:

UPDATE dbo.BookPrice
	SET NetworkProblemCount = 1
	WHERE ISBN = '9780575082076'

“But we need to know when it happened”, says the analyst. So TND thinks again, and adds another new column:

ALTER TABLE dbo.BookPrice
	ADD LastNetworkProblemDate DATETIME NULL

and UPDATEs the table again:

UPDATE dbo.BookPrice
	SET LastNetworkProblemDate = GETDATE()
	WHERE ISBN = '9780575082076'

Back comes the analyst, “But when was the last time we actually successfully retrieved a price?”, so TND adds another column:

ALTER TABLE dbo.BookPrice
	ADD LastSuccessfulPriceRetrievalDate DATETIME NULL

… and hopefully by this point, you can agree with me that continually adding columns is not the way we should be doing this!

The most straightforward way I know of to record interactions with other systems, is via Request and Response tables. Take a look at the following DDL:

CREATE TABLE dbo.BookPrice_Request
(
	RequestID INT NOT NULL IDENTITY(1,1)
		CONSTRAINT PK_BookPrice_Request PRIMARY KEY CLUSTERED
	,ISBN CHAR(13) NOT NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_BookPrice_Request_CreatedOn DEFAULT(GETDATE())
)

CREATE TABLE dbo.BookPrice_ResponseType
(
	ResponseTypeID TINYINT NOT NULL
		CONSTRAINT PK_BookPrice_ResponseType PRIMARY KEY CLUSTERED
	,[Description] VARCHAR(255) NOT NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_BookPrice_ResponseType_CreatedOn DEFAULT(GETDATE())
)
INSERT dbo.BookPrice_ResponseType(ResponseTypeID, [Description])
	VALUES(10, 'Price obtained successfully')
		, (20, 'Out of stock')
		, (30, 'Network error')

CREATE TABLE dbo.BookPrice_Response
(
	ResponseID INT NOT NULL IDENTITY(1,1)
		CONSTRAINT PK_BookPrice_Response PRIMARY KEY CLUSTERED
	,RequestID INT NOT NULL
	,ResponseTypeID TINYINT NOT NULL
	,Price SMALLMONEY NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_BookPrice_Response_CreatedOn DEFAULT(GETDATE())
	,CONSTRAINT FK_BookPrice_Response_RequestID
		FOREIGN KEY (RequestID)
		REFERENCES dbo.BookPrice_Request(RequestID)
	,CONSTRAINT FK_BookPrice_Response_ResponseTypeID
		FOREIGN KEY (ResponseTypeID)
		REFERENCES dbo.BookPrice_ResponseType(ResponseTypeID)
)

You can run the code above in a test database, and it’ll create three tables: a Request, a Response, and a ResponseType. When our internal service receives a request for a book price, it INSERTs into the Request table. When a reply is received from Amazon, we INSERT into the Response table, with (a) the RequestID we generated earlier, and (b) an indication of what type of response we received.

To recreate what was happening with TND’s service, run the following SQL:


-- Call #1
INSERT dbo.BookPrice_Request(ISBN)
	VALUES('9780575082076')
-- The calling code would get the newly-INSERTed RequestID from SCOPE_IDENTITY()
-- This is the first INSERT into the table, so the returned RequestID will be 1.
-- ... our service gets the response from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(1, 10, 8.99)

-- Call #2
INSERT dbo.BookPrice_Request(ISBN)
	VALUES('9780575082076')
-- the RequestID returned is 2
-- ... our service gets the response 'Out of stock' from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(2, 20, NULL)

-- Call #3
INSERT dbo.BookPrice_Request(ISBN)
	VALUES('9780575082076')
-- the RequestID returned is 3
-- ... our service call fails, there's a problem with the network...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(3, 30, NULL)

Let’s say a fourth service call was completely successful:


-- Call #4
INSERT dbo.BookPrice_Request(ISBN)
	VALUES('9780575082076')
-- the RequestID returned is 4
-- ... our service gets the response from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(4, 10, 6.99) -- it's on offer at a reduced price

Then to get the latest price for a book (and when that price was obtained), the query is simply:


SELECT
	TOP 1
		rp.Price
		,rp.CreatedOn
	FROM dbo.BookPrice_Response rp
	JOIN dbo.BookPrice_Request rq
		ON rq.RequestID = rp.RequestID
	WHERE rq.ISBN = '9780575082076'
	AND rp.ResponseTypeID = 10
	ORDER BY rp.ResponseID DESC

Our Request and Response tables have a complete record of everything that has happened in our service, and it’s simple for the analyst to build reports around how efficient the service is, and if it’s working correctly.

Before I go, a quick ‘gotcha’ – it’s fresh in my mind because I noticed this in a particular service’s database a few days ago: every Request record had a Response. Now, while that’s certainly the perfect situation, it’s just not realistic; things go wrong. In this case, it turns out that the INSERT into the Request table was only being done at the end of the process, only when a Response had been successfully received. So errors were being returned from the 3rd party system, but the service code was bombing out – therefore we had no visibility of this in the database. It’s very important: the Request and Response have to be written back to the database as soon as they happen, without delay.

, ,

Leave a comment