Archive for category architecture
As I wrote about here, I like to pass structured data around using XML (where applicable). While I think it’s the most flexible way, it’s understandable that people might not want to learn XML-parsing syntax, if they don’t have to. Another way of passing data into stored procedures (sprocs) is to use user-defined types (UDTs).
Here’s a simple example:
-- If the UDT already exists, delete it: IF EXISTS ( SELECT * FROM sys.types WHERE is_table_type = 1 AND [name] = 'ListOfDates' AND [schema_id] = SCHEMA_ID('dbo') ) BEGIN DROP TYPE dbo.ListOfDates END GO -- Create our UDT: CREATE TYPE dbo.ListOfDates AS TABLE ( DateTypeID TINYINT NOT NULL ,[Date] DATE NOT NULL ) GO -- Let's test it out: SET DATEFORMAT YMD DECLARE @MyListOfDates AS dbo.ListOfDates INSERT @MyListOfDates(DateTypeID, [Date]) VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03') ,(2, '2016Oct10'),(2, '2017Jan01') SELECT * FROM @MyListOfDates GO DateTypeID Date ---------- ---------- 1 2016-01-01 1 2016-01-02 1 2016-01-03 2 2016-10-10 2 2017-01-01
Hopefully, that looks straightforward; the way we’ve used it here is not dissimilar to a table variable (e.g. DECLARE @MyTable AS TABLE...), but we can’t pass a table variable to a sproc. We’ll create a test sproc and pass our new user-defined type to it:
-- If the sproc already exists, delete it: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create our sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates AS dbo.ListOfDates READONLY ) AS BEGIN SELECT DateTypeID ,MIN([Date]) AS MinDate ,MAX([Date]) AS MaxDate FROM @ListOfDates GROUP BY DateTypeID ORDER BY DateTypeID END GO -- Test it out: DECLARE @MyListOfDates AS dbo.ListOfDates INSERT @MyListOfDates(DateTypeID, [Date]) VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03') ,(2, '2016Oct10'),(2, '2017Jan01') ,(3, '2017Feb01'),(3, '2017Feb03'),(3, '2017Feb28') EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates GO DateTypeID MinDate MaxDate ---------- ---------- ---------- 1 2016-01-01 2016-01-03 2 2016-10-10 2017-01-01 3 2017-02-01 2017-02-28
See the READONLY in the CREATE PROCEDURE? Here’s what happens when you omit it:
The table-valued parameter "@ListOfDates" must be declared with the READONLY option.
This is one of the limitations of user-defined types: they can’t be used to pass data back. So we couldn’t make any changes to @ListOfDates in the sproc, and see those changes reflected outside of the sproc.
There’s another limitation: UDTs can’t be used cross-database. Here’s what happens:
The type name 'MyOtherDatabase.dbo.ListOfDates' contains more than the maximum number of prefixes. The maximum is 1.
Even if you create the exact same type in another database, with the same name, it won’t work.
(Note that UDTs don’t have to be tables: go here [MSDN] for more info.)
Just for kicks, here’s how I’d replicate the above functionality, but using XML:
-- If the sproc already exists, delete it: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create the sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates XML ) AS BEGIN WITH cte_ExtractFromXML AS ( SELECT DateTypeID = d.i.value('(@type)', 'INT') ,[Date] = d.i.value('.', 'DATE') FROM @ListOfDates.nodes('//date') d(i) ) SELECT DateTypeID ,MIN([Date]) AS MinDate ,MAX([Date]) AS MaxDate FROM cte_ExtractFromXML GROUP BY DateTypeID ORDER BY DateTypeID END GO -- Test it with some XML data: DECLARE @MyListOfDates XML SET @MyListOfDates = ' <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates>' EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates GO DateTypeID MinDate MaxDate ----------- ---------- ---------- 1 2016-01-01 2016-01-03 2 2016-10-10 2017-01-01 3 2017-02-01 2017-02-28
If we wanted to, we could declare our @ListOfDates parameter as OUTPUT, and make changes to it in the sproc, e.g.:
Drop the sproc if it exists: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create our (new and improved) sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates XML OUTPUT ) AS BEGIN DECLARE @Aggregated XML ;WITH cte_ExtractFromXML AS ( SELECT DateTypeID = d.i.value('(@type)', 'INT') ,[Date] = d.i.value('.', 'DATE') FROM @ListOfDates.nodes('//date') d(i) ) SELECT @Aggregated = ( SELECT DateTypeID AS '@id' ,MIN([Date]) AS '@MinDate' ,MAX([Date]) AS '@MaxDate' FROM cte_ExtractFromXML GROUP BY DateTypeID ORDER BY DateTypeID FOR XML PATH('DataType'), ROOT('Aggregated') ) SELECT @ListOfDates = ( SELECT @ListOfDates, @Aggregated FOR XML PATH('Output'), TYPE ) END GO -- Run it: DECLARE @MyListOfDates XML SET @MyListOfDates = ' <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates>' EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates OUTPUT SELECT @MyListOfDates AS MyListOfDates GO <Output> <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates> <Aggregated> <DataType id="1" MinDate="2016-01-01" MaxDate="2016-01-03" /> <DataType id="2" MinDate="2016-10-10" MaxDate="2017-01-01" /> <DataType id="3" MinDate="2017-02-01" MaxDate="2017-02-28" /> </Aggregated> </Output>
As you can see, we’ve returned our original list, along with the aggregated data, both under a new parent node.
In my previous post, I wrote about GUIDs (aka UUIDs), marvellous datatypes that allow you to uniquely identify a piece of information. Because the probability of randomly generating two the same is so small, there’s a sense that merely possessing one as a key to a piece of data, means the data is somehow secured. This is false, in two senses — the first sense, not so critical; the second, very critical.
Before that, let’s look at how they might be used. I’ll use URLs / webpages as an example, because it’s a common application.
Let’s say I’ve written my own blogging system. The text of each article is stored in a database table, and the primary key of the table uniquely identifies my blog posts. When I list my blog posts so people can click on them, the URL might look like this:
The unique identifier — in this example, the number 17 — forms part of the URL, and someone can look at all my blog posts merely by changing the number after the equals sign. It’s ok, it’s only a blog; the worst they can do is put in a number of a post I haven’t published yet. Depending on how I’ve written my blogging engine, they’ll either see an unfinished post, or (preferably) a message saying there’s currently no live article with that id.
In the bad old days of internet website development, lesser developers used integer IDs to refer to data more sensitive than a blog post:
(Of course, I’m making these URLs up!) The website might ask you to log into some part of the site via a secure method, but once you were in, you’d be able to view the records of anyone on the system! Clearly a terrible thing, but this type of security hole has always existed; here’s an example from a few years ago, concerning Apple’s password reset process.
But it’s ok, we know about GUIDs now! GUIDs to the rescue! If the URL looks like this:
, and the website retrieves each patient’s private information via a GUID, then we’re perfectly safe, no?
No, not safe (sense 1)
In the sense that a GUID is ‘hard for a human to guess’, yes, you’re pretty safe. No-one is going to randomly type in some digits and hit upon a valid GUID. BUT:
1. What if it’s not someone typing? What if it’s a piece of software making thousands of guesses per second? What if it’s a distributed botnet, millions of pieces of software each making thousands of guesses per second? * And your site has millions of users? The probability of making at least one correct guess would not be negligible.
[* Presuming your server can cope with all those requests…]
2. What if your GUID isn’t all that random? There are people out there who, given enough examples, can derive information about your server’s random number generator, and make accurate guesses about historical and future GUIDs. The situation is worse still if you’ve used sequential GUIDs; and potentially catastrophic if the developer who’s written the code hasn’t really understood the point of them (see previous post).
Point 1 can be addressed by putting checks in place for incorrect guesses. Modern server set-ups will allow you to block IP addresses after unusual patterns of requests have been detected.
Point 2 is harder to address: how truly random is the code that generates your GUIDs? In practice, you just don’t know. The study of random number generation is an entire academic discipline of study, you could devote the rest of your life to assessing various random number generators!
But the above doesn’t matter because…
No, not safe (sense 2)
This is the crux of the matter: obscurity (hiding something) is not security. All it does is make something harder to locate, not impossible. Just because you can’t guess a GUID, it doesn’t mean there aren’t other ways of obtaining them, e.g.:
- Leaking (e.g. via accidental email forwarding)
- Social engineering
- Mistakes (e.g. devs ‘temporarily’ storing them in webpage HTML)
The point is that proper security needs to be applied on top, in all cases. Where sensitive information is concerned, people should be logging in securely, with as robust a system as it’s possible to use. So even if a bad person stumbles across information (e.g. GUIDs) they shouldn’t have, they can’t use it, because they don’t have the requisite access.
To sum up, using a GUID on its own is nowhere near adequate-enough security, there’s much more to be done. A final tip: Given that there’s no such thing as “100% secure”, your goal should be to record all user activity on your site / app, and dive into it regularly to check for breaches or unusual patterns. As a bonus, you get to see how users are really using your software — I promise there’ll be some surprises in there!
If you know what GUIDs are, please click here to skip to the part where I talk about using them in databases.
GUIDs (Globally Unique Identifiers, aka UUIDs) are simply a string of 32 random hexadecimal numbers (that is: characters 0-9 and A-F), separated into five groups by dashes.
Here’s a GUID:
All modern languages are able to generate them; here’s how I generated it in SQL:
SELECT NEWID() GO ------------------------------------ C2DFE25B-6C1B-46B3-9497-DA45EF76D994 (1 row(s) affected)
A GUID is simply a big random number, presented in a human-readable form. How big is ‘big’? With 32 hex digits, it means a GUID can take any of 16^32 = 2^128 values. (2^128 is approximately 3.4 x 10^38)
GUIDs are big. They’re so big, that you could label every atom in the universe using just 3 GUIDs. In fact, it’d be massive overkill: 3 GUIDs have a potential 2^384 values between them, which is equal to 3.940201 x 10^115; the number of atoms in the universe is estimated at 10^82, many orders of magnitude less.
Because GUIDs can take such an enormous range of values, the chances of generating a duplicate are minuscule. Quote:
“In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.” (wikipedia)
The ‘U’ in GUID basically means ‘unique for all practical purposes you’re likely to ever be involved with’ (unless you work for CERN, in which case I take it back).
So, that’s great: we have this construct that for all intents and purposes is unique (and I’ll treat it as such from here on), and we can generate one any time we want one. But how are they used?
The most common usage of GUIDs is as keys for referring to other pieces of information, especially a block of structured information. For example, when I request a customer’s credit file, there’s a GUID, right near the top of the file. If I need to refer to that credit file again (whether inside my organisation, or with the issuing bureau), I can refer to it by the GUID, and we all know exactly which file I mean — not just the customer/address it refers to, but the data as it stood at that point.
Now, database tables need a primary key to identify each row – and, by definition, the value of the key has to be unique. So it would seem a natural thing to want to have a GUID as a primary key. Even better: not only will we ensure that every row in our table will be unique, but every row in every table can be uniquely identified, in every database in the world! And you don’t even need to request a GUID from your database server when you create the data for a row, you can pre-generate primary keys in your C# code, and use them before they ever need to be stored on the server!
Sounds too good to be true, so what’s the catch?
First off, most developers, analysts (and even DBAs) talk about ‘primary keys’ when they mean clustering keys – often, they’re the same piece of information, but they absolutely don’t have to be. The primary key is the piece of data that uniquely identifies a row in a table. The clustering key is the piece of data that determines the order of the data when it’s stored (on disk). More often than not, a straightforward incrementing integer (1,2,3…) can do the job of both, but it’s an informed choice that the database developer should be making.
When the clustering key is an incrementing integer, organising the data on disk is easy: the data goes in the next available slot. But when it’s (effectively) a random number, where does it go? The database has to make guesses about how much data there’s likely to be; guesses that it’ll have to re-assess every time a new row needs to be INSERTed into the database – worst case, it’s re-organising the data on disk every few INSERTs. This is really inefficient, and causes unnecessary stress on your server.
Internally in SQL Server, GUIDs take up 16 bytes of space, compared to the 4 bytes of an INT, or 8 bytes of a BIGINT. That’s not a major issue, unless you have lots of indexes on your table: indexes on tables automatically contain the clustering key, so with a GUID clustering key, every single index defined on that table will also contain the GUID. Potentially lots of valuable space used up, if you’re not careful.
Ok, let’s list some bad points about GUIDs as clustering keys:
- They cause inefficiencies under the hood: the server can’t make it’s usual good guesses about where to store data. NB: There is such a thing as a SEQUENTIAL GUID (Info here at MSDN), which lessens the impact – personally, I still wouldn’t bother.
- They take up four times more space than traditional INTs, which could be a problem if you have lots of indexes.
- Table JOINs are slower; SQL Server is optimised for joining tables together via simple integers.
There’s another (very important) reason not to use them that people tend to overlook: it makes debugging and tracking down errors incredibly painful! Incrementing numbers are intuitive, easy to memorise (if they’re small enough), easy to compare (“x+z” happened after “x”)… but GUIDs are just a ‘blob’ of data, there’s nothing intuitive about them.
How to use GUIDs, pain-free
It’s simple: add a GUID as a normal column and index it!
ALTER TABLE dbo.Person ADD COLUMN PersonUID UNIQUEIDENTIFIER NULL GO -- ...UPDATE the table to fill PersonUID here ... CREATE NONCLUSTERED INDEX IX_Person_PersonUID ON dbo.Person(PersonUID) GO
That’s as complex as it needs to be.
Sad note: In the past, I’ve had developers add GUIDs that looked like this:
00000000-0000-0000-0000-000000000001 00000000-0000-0000-0000-000000000002 00000000-0000-0000-0000-000000000003 ...
, thus completely missing the point of using a GUID in the first place. If you’re not auto-generating the GUIDs in-database, make sure to check devs aren’t putting daft data in. (Obviously something you should do anyway!)
Finally, we have our highly unique piece of information, stored in the best way possible. So now we can use that in a URL on a public-facing website to reference people’s data securely, no? Sorry, that’s not quite right — a little issue of this thing called Security Through Obscurity, which I’ll write about next time.
Part 1: NULL
As a junior web developer, I remember other developers warning me about database NULLs: “you don’t really want to deal with them”, “code them out if you can”, “program around them”, “turn them into something else”. Dear reader, they were quite wrong.
For a piece of data, a NULL value can mean any of:
- We don’t know this.
- We don’t know this yet (but there’s an expectation we’ll know this at a later date).
- We don’t know this, because the question of “What is the value of X for object Y?” is not applicable here.
- We don’t know this, and the chances of us ever knowing it are practically zero.
- It is logically impossible for this data to exist.
Context usually clues us into which meaning of NULL we’re looking at; if a customer’s email address field is NULL, then in terms of the above:
- It hasn’t been asked for / provided / collected.
- It hasn’t been asked for / provided / collected yet, but we might be getting this data in the future.
- The customer doesn’t have an email address.
- The rest of the customer’s data is incorrect or missing, so we have no means of contacting them to find their email address.
- The ‘customer’ isn’t something capable of owning an email address(*).
Regardless, if a customer doesn’t have an email address in the system, any code that consumes customer data will have to cope in a sensible manner. If the code is a data entry form, then an empty text field will be displayed; but if the code does marketing mail-outs, then it’ll just have to skip that record.
(*) It could be that the table is ‘multi-use’, and the field makes no sense for some types of data.
Going back to meaning (5), a couple of better examples might be:
- ‘O’ level GCSE results (at age 16): My age cohort did GCEs, therefore it is logically impossible for anyone to ascertain my GCSE results.
- Date of last gynaecological exam: Clearly, this would never be applicable for anyone born genetically male.
(In multivariate analysis, these would be referred to as structural zeros, rather than the usual sampling zeros. “It was impossible for it to occur” vs. “We did not see it occur”.)
Despite NULL being the very embodiment of “no information”, sometimes “no information” is the information in itself! Trivially, e.g., a SQL query to find all customers without email addresses, will specifically look for the NULL value in that field. Data with NULLs in be indexed, same as any other data. You can even create a filtered index that goes straight to the NULL data:
CREATE INDEX IX_Customer_EmailIsNULL ON dbo.Customer(Email) WHERE Email IS NULL
So a NULL value is generally not something to be avoided, modified, or coded around. It is eminently useful, and a vital part of your data structures.
Part 2: Bad magic
Now, I started with part 1 because of a common pattern I see used in data capture, usually due to novice / junior / misguided developers. An example: I have a database table of addresses (called Address), with the usual fields. My company operates strictly within the UK, so in an effort to keep our data as clean as possible, we have a CHECK constraint on the Postcode field; not a foreign key to a table of all postcodes ever (who wants to maintain that??), but a simple check against the UK postcode format. The check will prevent entries like “unknown”, or mistakes like “SW1A IAA” (‘I’ instead of ‘1’). Also, the postcode is ‘NOT NULL’-able — because every address has a postcode, right?
It might look like this:
CREATE TABLE xyz.[Address] ( AddressID INT NOT NULL PRIMARY KEY ,Line1 VARCHAR(255) NOT NULL ,Line2 VARCHAR(255) NULL ,Postcode VARCHAR(10) NOT NULL CHECK (Postcode LIKE '[A-Z][0-9] [0-9][A-Z][A-Z]' OR Postcode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]') )
(Clearly the CHECK constraint isn’t exhaustive: as it is, it’ll reject SW1A 1AA, the postcode of Buckingham Palace. It’ll do for illustrating the point.)
If customer data is supplied without a postcode, then any INSERT will fail. What tends to happen, is that over time, you’ll see the Postcode field start to contain values like ZZ1 1ZZ; a value that passes our simple CHECK constraint rules, but is probably not a valid UK postcode.
So how did ZZ1 1ZZ get into the database?
The developer coding the application form tried to INSERT a record with no postcode, thus the operation failed with an error. So in the input form, they put some code to change a blank postcode to ZZ1 1ZZ when INSERT-ing.
The customer input form hooks up to an address validator; if the address cannot be validated, then the customer is asked to fill in all the fields themself, and can easily put in an invalid postcode which doesn’t make it past the simple check constraint on the Address table. The developer dutifully catches the error, changes the postcode to ZZ1 1ZZ and re-INSERTs.
A customer complained about being marketed to, and needs to be removed from the database as soon as possible. To do it properly would mean changing code in several systems; the quick hack is to change their postcode to ZZ1 1ZZ, then make sure the mail-out query ignores records with that postcode value. This is then adopted as semi-official practice: “To remove a customer from marketing, just set their postcode to ZZ1 1ZZ.”
There are multiple problems with having a postcode of ZZ1 1ZZ meaning ‘unknown’, ‘error’ or ‘do not contact’:
- It’s a ‘magic’ string; for it to have system-wide meaning, every single system must understand it, and what it represents. What if someone INSERT-ed ZZ2 2ZZ? It wouldn’t be similarly understood, it would be treated as a real postcode.
- Every new developer and analyst has to be told about the magic string. What if there’s a magic string for every piece of data? Ok, it could be solved by using VIEWs, but then that’s more code that has to be known about, and scrupulously maintained.
- What if, by some mistake, post is sent out to that postcode? (This will happen, I guarantee it.) One of your other systems is likely recording the fact that mail has been sent correctly, but the chances of it arriving are slim.
- The real postcode ZZ1 1ZZ might not exist now, but it may in the future: there are many examples of postcodes falling into and out of use. How will you know if your postcode is genuine, or a magic string? Take note: postcodes that start ZZ99 are real live NHS “pseudo-postcodes”…
As you’ve probably realised, my answer would be to make the postcode field NULL-able(*), and to INSERT a NULL in the case of missing or broken data, completely avoiding any magic strings. It needs no special handling, and contextually, it very probably has a limited range of well-understood meanings; e.g. if you see a field MiddleName that is NULL for some records, you would presume it to mean the Customer has no middle name.
Note this is why in the email example in Part 1, we shouldn’t use a blank string instead of a NULL – because a blank string is still a ‘magic’ string, just one that would happen to be widely understood. There will be cases when a blank string legitimately means something quite different to a NULL.
(*) I’ve heard people claim that fields with CHECK constraints can’t be NULL-able. In modern flavours of SQL Server, this is demonstrably false. If the field is NULL, the constraint just isn’t checked.
Part 3: Keeping track
By clearing up one locus of ambiguity, I’m afraid I’m going to introduce a new one. Presumably, we’re going to want to record why our postcode field is NULL. We can either:
(A) Create a new lookup table, ReasonForNull (say); add a new field, ReasonForNullID, to our Address table, add a suitable foreign key, and a CHECK constraint that says “if the Postcode is NULL, then ReasonForNullID must not be NULL – and vice versa”, e.g.:
ALTER TABLE xyz.[Address] ADD CONSTRAINT CK_Address_PostcodeOrReason CHECK( (Postcode IS NOT NULL AND ReasonForNullID IS NULL) OR (Postcode IS NULL AND ReasonForNullID IS NOT NULL) )
(B) Create our new lookup table (as above), but also create another new table, Address_ReasonForNull, like so:
CREATE TABLE xyz.Address_ReasonForNull ( AddressID INT NOT NULL CONSTRAINT PK_Address_ReasonForNull PRIMARY KEY CLUSTERED ,ReasonForNullID TINYINT NOT NULL ,CreatedOn DATETIME NOT NULL CONSTRAINT DF_Address_ReasonForNull_CreatedOn DEFAULT(GETDATE()) ,CONSTRAINT FK_Address_ReasonForNull_AddressID FOREIGN KEY (AddressID) REFERENCES xyz.Address(AddressID) ,CONSTRAINT FK_Address_ReasonForNull_ReasonForNullID FOREIGN KEY (ReasonForNullID) REFERENCES xyz.ReasonForNull(ReasonForNullID) )
and only INSERT into it when we have an invalid postcode.
Neither (A) nor (B) is a perfect solution. (A) will waste a byte per Address record (if ReasonForNullID is declared as a TINYINT) if the postcode is ok, but has the advantage of strictly maintaining integrity, thanks to the CHECK constraint. (B) wastes no space, but there is no simple way (that I know of) of enforcing that a child record must exist, given data in the parent record.
If we want to record, say, the postcode that was entered but not validated, then it’s no bother under scenario (B) to add a new field to our Address_ReasonForNull table:
ALTER TABLE xyz.Address_ReasonForNull ADD OriginalData VARCHAR(20) NULL
However, if we were doing (A), then we’d have to add this column to the main Address table (and change the CHECK constraint); potentially, we could waste a lot of space.
Personally, I’d favour (B), and would push for all data changes to be made via stored procedures (aka sprocs). That way, I can ensure that the data in my two tables is kept perfectly in sync.
Any thoughts or comments? Feel free to let us know!
“Yeah, yeah, but your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.” — Dr Ian Malcolm, Jurassic Park
(Forgive the over-used quote above, but it feels apt — I think you’ll see why.)
In our hypothetical situation, we have two types of Employee: a Manager and an Agent. So to begin with, we have Employee and EmployeeType tables:
CREATE TABLE dbo.EmployeeType ( EmployeeTypeID TINYINT NOT NULL CONSTRAINT PK_EmployeeType PRIMARY KEY CLUSTERED ,[Description] VARCHAR(50) NOT NULL ) GO INSERT dbo.EmployeeType(EmployeeTypeID, [Description]) VALUES(1, 'Agent'), (2, 'Manager') GO
We’ll say that Agents get a ‘terminal code’ (to access their terminals); Managers don’t need a terminal code, but they do get to use the staff car park, so we’ll need to store their car registration — but only for Managers, Agents aren’t allowed to park at work.
CREATE TABLE dbo.Employee ( EmployeeID INT NOT NULL CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED ,EmployeeTypeID TINYINT NOT NULL ,[Name] VARCHAR(50) NOT NULL ,StartDate DATE NOT NULL -- 1. Agent-specific ,TerminalCode VARCHAR(5) NULL -- 2. Manager-specific ,CarReg VARCHAR(10) NULL ,CONSTRAINT FK_Employee_EmployeeTypeID FOREIGN KEY (EmployeeTypeID) REFERENCES dbo.EmployeeType(EmployeeTypeID) ) GO
Let’s put some data in:
INSERT dbo.Employee ( EmployeeID ,EmployeeTypeID ,[Name] ,StartDate ,TerminalCode ,CarReg ) VALUES(1, 1, 'Alan Smith', '2001Jan01', 'AS111', NULL) ,(2, 1, 'Brenda Thomson', '2002Feb02', 'BT222', NULL) ,(3, 1, 'Colin Underwood', '2003Mar03', 'CU333', NULL) ,(4, 2, 'Diana Varney', '2004Apr04', NULL, 'DV65 QAZ') ,(5, 2, 'Edward Wilson', '2005May05', NULL, 'EW65 WSX') GO
So far, so straightforward. Now let’s muck things up:
UPDATE dbo.Employee SET CarReg = 'AB91 EDC' WHERE EmployeeID = 1 GO
which of course works, but now our mere Agent gets to park for free. (How disastrous!) We can fix this — we’ll put the NULL back in the CarReg field, and add a CHECK constraint to stop it happening again:
UPDATE dbo.Employee SET CarReg = NULL WHERE EmployeeID = 1 GO ALTER TABLE dbo.Employee ADD CONSTRAINT CK_Employee_CheckFields CHECK ( CASE WHEN EmployeeTypeID = 1 THEN CASE WHEN TerminalCode IS NOT NULL AND CarReg IS NULL THEN 1 ELSE 0 END WHEN EmployeeTypeID = 2 THEN CASE WHEN TerminalCode IS NULL AND CarReg IS NOT NULL THEN 1 ELSE 0 END END = 1 ) GO
If we try and set the CarReg for an Agent now, it’ll fail:
The UPDATE statement conflicted with the CHECK constraint "CK_Employee_CheckFields"
Well, that works! Job done, no? Not really… every time we need to add or remove Agent- or Manager-specific fields, we’ll have to change that CHECK constraint; what if there were 100 fields? The constraint could get pretty unwieldy. Not to mention the wasted column space: Agent fields will always be NULL for Managers, and vice versa.
Drop the CHECK constraint, the TerminalCode and CarReg fields in Employee; and then we’ll create two new tables to hold the type-specific data:
CREATE TABLE dbo.AgentData ( EmployeeID INT NOT NULL CONSTRAINT PK_AgentData PRIMARY KEY CLUSTERED ,TerminalCode VARCHAR(5) NULL ) GO INSERT dbo.AgentData(EmployeeID, TerminalCode) VALUES(1, 'AS111'), (2, 'BT222'), (3, 'CU333') GO CREATE TABLE dbo.ManagerData ( EmployeeID INT NOT NULL CONSTRAINT PK_ManagerData PRIMARY KEY CLUSTERED ,CarReg VARCHAR(10) NULL ) GO INSERT dbo.ManagerData(EmployeeID, CarReg) VALUES(4, 'DV65 QAZ'), (5, 'EW65 WSX') GO
(Foreign keys on EmployeeID removed for clarity.)
That’s better! Except how do we enforce that AgentData only holds EmployeeIDs that relate to Agents, and that ManagerData only holds EmployeeIDs that relate to Managers? We could use a trigger, absolutely. But for the purposes of this post, we have an irrational fear of triggers and don’t want to use them! Can we enforce our rules with just keys?
(And here’s where the quote at the beginning becomes relevant…) Yes, I think we can, but I don’t know that we should. The following solution seems to work, but I don’t know if I’d ever use it in practise, it feels over-the-top. (I should also add, I’ve not tested it in a production-like environment, it might be unsuitable for some reason I’ve not yet discerned.) Anyway, here it is.
Starting from scratch:
CREATE TABLE dbo.Employee ( EmployeeID INT NOT NULL CONSTRAINT UQ_Employee_EmployeeID UNIQUE ,EmployeeTypeID TINYINT NOT NULL ,[Name] VARCHAR(50) NOT NULL ,StartDate DATE NOT NULL ,CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED (EmployeeID, EmployeeTypeID) ,CONSTRAINT FK_Employee_EmployeeTypeID FOREIGN KEY (EmployeeTypeID) REFERENCES dbo.EmployeeType(EmployeeTypeID) )
We’ve given the Employee table a composite PRIMARY KEY, comprised of the EmployeeID, and the EmployeeTypeID. The EmployeeID also has a UNIQUE index on it, for two reasons: (1) obviously, to stop the same EmployeeID having multiple types, and (2) so that other tables can reference the EmployeeID in a FOREIGN KEY, without having to include the EmployeeTypeID as well. Without the UNIQUE constraint, we’d get the following error:
There are no primary or candidate keys in the referenced table 'dbo.Employee' that match the referencing column list in the foreign key
, but with it, we’re ok. We’ll INSERT the same employee details as before:
INSERT dbo.Employee ( EmployeeID ,EmployeeTypeID ,[Name] ,StartDate ) VALUES(1, 1, 'Alan Smith', '2001Jan01') ,(2, 1, 'Brenda Thomson', '2002Feb02') ,(3, 1, 'Colin Underwood', '2003Mar03') ,(4, 2, 'Diana Varney', '2004Apr04') ,(5, 2, 'Edward Wilson', '2005May05') GO
Here’s a slightly icky bit: we’re going to fix the EmployeeTypeID of 1 in the AgentData table:
CREATE TABLE dbo.AgentData ( EmployeeID INT NOT NULL CONSTRAINT PK_AgentData PRIMARY KEY CLUSTERED ,EmployeeTypeID AS CAST(1 AS TINYINT) PERSISTED ,TerminalCode VARCHAR(5) NULL ,CONSTRAINT FK_AgentData_EmployeeID_EmployeeTypeID FOREIGN KEY (EmployeeID, EmployeeTypeID) REFERENCES dbo.Employee(EmployeeID, EmployeeTypeID) ) GO
, and fix the EmployeeTypeID of 2 in the ManagerData table:
CREATE TABLE dbo.ManagerData ( EmployeeID INT NOT NULL CONSTRAINT PK_ManagerData PRIMARY KEY CLUSTERED ,EmployeeTypeID AS CAST(2 AS TINYINT) PERSISTED ,CarReg VARCHAR(10) NULL ,CONSTRAINT FK_ManagerData_EmployeeID_EmployeeTypeID FOREIGN KEY (EmployeeID, EmployeeTypeID) REFERENCES dbo.Employee(EmployeeID, EmployeeTypeID) ) GO
Where both tables have a FOREIGN KEY onto the Employee table that includes the EmployeeTypeID, there’s no way to get the types mixed up: an Agent has to be an Agent in both tables, likewise for Managers.
Let’s INSERT the data and check we can query it without any problems:
INSERT dbo.AgentData(EmployeeID, TerminalCode) VALUES(1, 'AS111'), (2, 'BT222'), (3, 'CU333') GO INSERT dbo.ManagerData(EmployeeID, CarReg) VALUES(4, 'DV65 QAZ'), (5, 'EW65 WSX') GO SELECT e.* ,ad.TerminalCode ,md.CarReg FROM dbo.Employee e LEFT JOIN dbo.AgentData ad ON ad.EmployeeID = e.EmployeeID LEFT JOIN dbo.ManagerData md ON md.EmployeeID = e.EmployeeID ORDER BY e.EmployeeID GO
EmployeeID EmployeeTypeID Name StartDate TerminalCode CarReg ----------- -------------- ----------------- ---------- ------------ ---------- 1 1 Alan Smith 2001-01-01 AS111 NULL 2 1 Brenda Thomson 2002-02-02 BT222 NULL 3 1 Colin Underwood 2003-03-03 CU333 NULL 4 2 Diana Varney 2004-04-04 NULL DV65 QAZ 5 2 Edward Wilson 2005-05-05 NULL EW65 WSX (5 row(s) affected)
We can also add FOREIGN KEY references from other tables, pointing at the EmployeeIDs on the AgentData and ManagerData tables, and get the expected behaviour, e.g.:
CREATE TABLE dbo.ManagerEmail ( EmployeeID INT NOT NULL CONSTRAINT PK_ManagerEmail PRIMARY KEY CLUSTERED ,EmailAddress VARCHAR(255) NOT NULL ,CONSTRAINT FK_ManagerEmail_EmployeeID FOREIGN KEY (EmployeeID) REFERENCES dbo.ManagerData(EmployeeID) ) GO INSERT dbo.ManagerEmail(EmployeeID, EmailAddress) VALUES(4, 'email@example.com') GO -- works fine, Diana is a Manager INSERT dbo.ManagerEmail(EmployeeID, EmailAddress) VALUES(3, 'firstname.lastname@example.org') GO -- Error: -- "The INSERT statement conflicted with the -- FOREIGN KEY constraint "FK_ManagerEmail_EmployeeID".
What do you think? Does this work for you, or would a trigger be more straightforward? As always, please let me know in the comments, I’d love to hear from you.