Archive for March, 2016

The meaning of NULL, and why magic data is bad

Part 1: NULL

As a junior web developer, I remember other developers warning me about database NULLs: “you don’t really want to deal with them”, “code them out if you can”, “program around them”, “turn them into something else”. Dear reader, they were quite wrong.

For a piece of data, a NULL value can mean any of:

  1. We don’t know this.
  2. We don’t know this yet (but there’s an expectation we’ll know this at a later date).
  3. We don’t know this, because the question of “What is the value of X for object Y?” is not applicable here.
  4. We don’t know this, and the chances of us ever knowing it are practically zero.
  5. It is logically impossible for this data to exist.

Context usually clues us into which meaning of NULL we’re looking at; if a customer’s email address field is NULL, then in terms of the above:

  1. It hasn’t been asked for / provided / collected.
  2. It hasn’t been asked for / provided / collected yet, but we might be getting this data in the future.
  3. The customer doesn’t have an email address.
  4. The rest of the customer’s data is incorrect or missing, so we have no means of contacting them to find their email address.
  5. The ‘customer’ isn’t something capable of owning an email address(*).

Regardless, if a customer doesn’t have an email address in the system, any code that consumes customer data will have to cope in a sensible manner. If the code is a data entry form, then an empty text field will be displayed; but if the code does marketing mail-outs, then it’ll just have to skip that record.

(*) It could be that the table is ‘multi-use’, and the field makes no sense for some types of data.

Going back to meaning (5), a couple of better examples might be:

  • ‘O’ level GCSE results (at age 16):   My age cohort did GCEs, therefore it is logically impossible for anyone to ascertain my GCSE results.
  • Date of last gynaecological exam:   Clearly, this would never be applicable for anyone born genetically male.

(In multivariate analysis, these would be referred to as structural zeros, rather than the usual sampling zeros. “It was impossible for it to occur” vs. “We did not see it occur”.)

Despite NULL being the very embodiment of “no information”, sometimes “no information” is the information in itself! Trivially, e.g., a SQL query to find all customers without email addresses, will specifically look for the NULL value in that field. Data with NULLs in be indexed, same as any other data. You can even create a filtered index that goes straight to the NULL data:

CREATE INDEX IX_Customer_EmailIsNULL
  ON dbo.Customer(Email)
  WHERE Email IS NULL

So a NULL value is generally not something to be avoided, modified, or coded around. It is eminently useful, and a vital part of your data structures.


Part 2: Bad magic

Now, I started with part 1 because of a common pattern I see used in data capture, usually due to novice / junior / misguided developers. An example: I have a database table of addresses (called Address), with the usual fields. My company operates strictly within the UK, so in an effort to keep our data as clean as possible, we have a CHECK constraint on the Postcode field; not a foreign key to a table of all postcodes ever (who wants to maintain that??), but a simple check against the UK postcode format. The check will prevent entries like “unknown”, or mistakes like “SW1A IAA” (‘I’ instead of ‘1’). Also, the postcode is ‘NOT NULL’-able — because every address has a postcode, right?

It might look like this:

CREATE TABLE xyz.[Address]
(
	AddressID INT NOT NULL PRIMARY KEY
	,Line1 VARCHAR(255) NOT NULL
	,Line2 VARCHAR(255) NULL
	,Postcode VARCHAR(10) NOT NULL
		CHECK (Postcode LIKE '[A-Z][0-9] [0-9][A-Z][A-Z]'
		OR Postcode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]')
)

(Clearly the CHECK constraint isn’t exhaustive: as it is, it’ll reject SW1A 1AA, the postcode of Buckingham Palace. It’ll do for illustrating the point.)

If customer data is supplied without a postcode, then any INSERT will fail. What tends to happen, is that over time, you’ll see the Postcode field start to contain values like ZZ1 1ZZ; a value that passes our simple CHECK constraint rules, but is probably not a valid UK postcode.

So how did ZZ1 1ZZ get into the database?

Scenario 1a:

The developer coding the application form tried to INSERT a record with no postcode, thus the operation failed with an error. So in the input form, they put some code to change a blank postcode to ZZ1 1ZZ when INSERT-ing.

Scenario 1b:

The customer input form hooks up to an address validator; if the address cannot be validated, then the customer is asked to fill in all the fields themself, and can easily put in an invalid postcode which doesn’t make it past the simple check constraint on the Address table. The developer dutifully catches the error, changes the postcode to ZZ1 1ZZ and re-INSERTs.

Scenario 2:

A customer complained about being marketed to, and needs to be removed from the database as soon as possible. To do it properly would mean changing code in several systems; the quick hack is to change their postcode to ZZ1 1ZZ, then make sure the mail-out query ignores records with that postcode value. This is then adopted as semi-official practice: “To remove a customer from marketing, just set their postcode to ZZ1 1ZZ.”

There are multiple problems with having a postcode of ZZ1 1ZZ meaning ‘unknown’, ‘error’ or ‘do not contact’:

  1. It’s a ‘magic’ string; for it to have system-wide meaning, every single system must understand it, and what it represents. What if someone INSERT-ed ZZ2 2ZZ? It wouldn’t be similarly understood, it would be treated as a real postcode.
  2. Every new developer and analyst has to be told about the magic string. What if there’s a magic string for every piece of data? Ok, it could be solved by using VIEWs, but then that’s more code that has to be known about, and scrupulously maintained.
  3. What if, by some mistake, post is sent out to that postcode? (This will happen, I guarantee it.) One of your other systems is likely recording the fact that mail has been sent correctly, but the chances of it arriving are slim.
  4. The real postcode ZZ1 1ZZ might not exist now, but it may in the future: there are many examples of postcodes falling into and out of use. How will you know if your postcode is genuine, or a magic string? Take note: postcodes that start ZZ99 are real live NHS “pseudo-postcodes”…

As you’ve probably realised, my answer would be to make the postcode field NULL-able(*), and to INSERT a NULL in the case of missing or broken data, completely avoiding any magic strings. It needs no special handling, and contextually, it very probably has a limited range of well-understood meanings; e.g. if you see a field MiddleName that is NULL for some records, you would presume it to mean the Customer has no middle name.

Note this is why in the email example in Part 1, we shouldn’t use a blank string instead of a NULL – because a blank string is still a ‘magic’ string, just one that would happen to be widely understood. There will be cases when a blank string legitimately means something quite different to a NULL.

(*) I’ve heard people claim that fields with CHECK constraints can’t be NULL-able. In modern flavours of SQL Server, this is demonstrably false. If the field is NULL, the constraint just isn’t checked.


Part 3: Keeping track

By clearing up one locus of ambiguity, I’m afraid I’m going to introduce a new one. Presumably, we’re going to want to record why our postcode field is NULL. We can either:

(A) Create a new lookup table, ReasonForNull (say); add a new field, ReasonForNullID, to our Address table, add a suitable foreign key, and a CHECK constraint that says “if the Postcode is NULL, then ReasonForNullID must not be NULL – and vice versa”, e.g.:

ALTER TABLE xyz.[Address]
ADD CONSTRAINT CK_Address_PostcodeOrReason 
CHECK( (Postcode IS NOT NULL AND ReasonForNullID IS NULL)
	OR (Postcode IS NULL AND ReasonForNullID IS NOT NULL)
)

or

(B) Create our new lookup table (as above), but also create another new table, Address_ReasonForNull, like so:

CREATE TABLE xyz.Address_ReasonForNull
(
	AddressID INT NOT NULL
		CONSTRAINT PK_Address_ReasonForNull
		PRIMARY KEY CLUSTERED
	,ReasonForNullID TINYINT NOT NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_Address_ReasonForNull_CreatedOn
		DEFAULT(GETDATE())
	,CONSTRAINT FK_Address_ReasonForNull_AddressID
		FOREIGN KEY (AddressID)
		REFERENCES xyz.Address(AddressID)
	,CONSTRAINT FK_Address_ReasonForNull_ReasonForNullID
		FOREIGN KEY (ReasonForNullID)
		REFERENCES xyz.ReasonForNull(ReasonForNullID)
)

and only INSERT into it when we have an invalid postcode.

Neither (A) nor (B) is a perfect solution. (A) will waste a byte per Address record (if ReasonForNullID is declared as a TINYINT) if the postcode is ok, but has the advantage of strictly maintaining integrity, thanks to the CHECK constraint. (B) wastes no space, but there is no simple way (that I know of) of enforcing that a child record must exist, given data in the parent record.

If we want to record, say, the postcode that was entered but not validated, then it’s no bother under scenario (B) to add a new field to our Address_ReasonForNull table:

ALTER TABLE xyz.Address_ReasonForNull
	ADD OriginalData VARCHAR(20) NULL

However, if we were doing (A), then we’d have to add this column to the main Address table (and change the CHECK constraint); potentially, we could waste a lot of space.

Personally, I’d favour (B), and would push for all data changes to be made via stored procedures (aka sprocs). That way, I can ensure that the data in my two tables is kept perfectly in sync.

Any thoughts or comments? Feel free to let us know!

, , , ,

Leave a comment