Posts Tagged addresses

The meaning of NULL, and why magic data is bad

Part 1: NULL

As a junior web developer, I remember other developers warning me about database NULLs: “you don’t really want to deal with them”, “code them out if you can”, “program around them”, “turn them into something else”. Dear reader, they were quite wrong.

For a piece of data, a NULL value can mean any of:

  1. We don’t know this.
  2. We don’t know this yet (but there’s an expectation we’ll know this at a later date).
  3. We don’t know this, because the question of “What is the value of X for object Y?” is not applicable here.
  4. We don’t know this, and the chances of us ever knowing it are practically zero.
  5. It is logically impossible for this data to exist.

Context usually clues us into which meaning of NULL we’re looking at; if a customer’s email address field is NULL, then in terms of the above:

  1. It hasn’t been asked for / provided / collected.
  2. It hasn’t been asked for / provided / collected yet, but we might be getting this data in the future.
  3. The customer doesn’t have an email address.
  4. The rest of the customer’s data is incorrect or missing, so we have no means of contacting them to find their email address.
  5. The ‘customer’ isn’t something capable of owning an email address(*).

Regardless, if a customer doesn’t have an email address in the system, any code that consumes customer data will have to cope in a sensible manner. If the code is a data entry form, then an empty text field will be displayed; but if the code does marketing mail-outs, then it’ll just have to skip that record.

(*) It could be that the table is ‘multi-use’, and the field makes no sense for some types of data.

Going back to meaning (5), a couple of better examples might be:

  • ‘O’ level GCSE results (at age 16):   My age cohort did GCEs, therefore it is logically impossible for anyone to ascertain my GCSE results.
  • Date of last gynaecological exam:   Clearly, this would never be applicable for anyone born genetically male.

(In multivariate analysis, these would be referred to as structural zeros, rather than the usual sampling zeros. “It was impossible for it to occur” vs. “We did not see it occur”.)

Despite NULL being the very embodiment of “no information”, sometimes “no information” is the information in itself! Trivially, e.g., a SQL query to find all customers without email addresses, will specifically look for the NULL value in that field. Data with NULLs in be indexed, same as any other data. You can even create a filtered index that goes straight to the NULL data:

CREATE INDEX IX_Customer_EmailIsNULL
  ON dbo.Customer(Email)
  WHERE Email IS NULL

So a NULL value is generally not something to be avoided, modified, or coded around. It is eminently useful, and a vital part of your data structures.


Part 2: Bad magic

Now, I started with part 1 because of a common pattern I see used in data capture, usually due to novice / junior / misguided developers. An example: I have a database table of addresses (called Address), with the usual fields. My company operates strictly within the UK, so in an effort to keep our data as clean as possible, we have a CHECK constraint on the Postcode field; not a foreign key to a table of all postcodes ever (who wants to maintain that??), but a simple check against the UK postcode format. The check will prevent entries like “unknown”, or mistakes like “SW1A IAA” (‘I’ instead of ‘1’). Also, the postcode is ‘NOT NULL’-able — because every address has a postcode, right?

It might look like this:

CREATE TABLE xyz.[Address]
(
	AddressID INT NOT NULL PRIMARY KEY
	,Line1 VARCHAR(255) NOT NULL
	,Line2 VARCHAR(255) NULL
	,Postcode VARCHAR(10) NOT NULL
		CHECK (Postcode LIKE '[A-Z][0-9] [0-9][A-Z][A-Z]'
		OR Postcode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]')
)

(Clearly the CHECK constraint isn’t exhaustive: as it is, it’ll reject SW1A 1AA, the postcode of Buckingham Palace. It’ll do for illustrating the point.)

If customer data is supplied without a postcode, then any INSERT will fail. What tends to happen, is that over time, you’ll see the Postcode field start to contain values like ZZ1 1ZZ; a value that passes our simple CHECK constraint rules, but is probably not a valid UK postcode.

So how did ZZ1 1ZZ get into the database?

Scenario 1a:

The developer coding the application form tried to INSERT a record with no postcode, thus the operation failed with an error. So in the input form, they put some code to change a blank postcode to ZZ1 1ZZ when INSERT-ing.

Scenario 1b:

The customer input form hooks up to an address validator; if the address cannot be validated, then the customer is asked to fill in all the fields themself, and can easily put in an invalid postcode which doesn’t make it past the simple check constraint on the Address table. The developer dutifully catches the error, changes the postcode to ZZ1 1ZZ and re-INSERTs.

Scenario 2:

A customer complained about being marketed to, and needs to be removed from the database as soon as possible. To do it properly would mean changing code in several systems; the quick hack is to change their postcode to ZZ1 1ZZ, then make sure the mail-out query ignores records with that postcode value. This is then adopted as semi-official practice: “To remove a customer from marketing, just set their postcode to ZZ1 1ZZ.”

There are multiple problems with having a postcode of ZZ1 1ZZ meaning ‘unknown’, ‘error’ or ‘do not contact’:

  1. It’s a ‘magic’ string; for it to have system-wide meaning, every single system must understand it, and what it represents. What if someone INSERT-ed ZZ2 2ZZ? It wouldn’t be similarly understood, it would be treated as a real postcode.
  2. Every new developer and analyst has to be told about the magic string. What if there’s a magic string for every piece of data? Ok, it could be solved by using VIEWs, but then that’s more code that has to be known about, and scrupulously maintained.
  3. What if, by some mistake, post is sent out to that postcode? (This will happen, I guarantee it.) One of your other systems is likely recording the fact that mail has been sent correctly, but the chances of it arriving are slim.
  4. The real postcode ZZ1 1ZZ might not exist now, but it may in the future: there are many examples of postcodes falling into and out of use. How will you know if your postcode is genuine, or a magic string? Take note: postcodes that start ZZ99 are real live NHS “pseudo-postcodes”…

As you’ve probably realised, my answer would be to make the postcode field NULL-able(*), and to INSERT a NULL in the case of missing or broken data, completely avoiding any magic strings. It needs no special handling, and contextually, it very probably has a limited range of well-understood meanings; e.g. if you see a field MiddleName that is NULL for some records, you would presume it to mean the Customer has no middle name.

Note this is why in the email example in Part 1, we shouldn’t use a blank string instead of a NULL – because a blank string is still a ‘magic’ string, just one that would happen to be widely understood. There will be cases when a blank string legitimately means something quite different to a NULL.

(*) I’ve heard people claim that fields with CHECK constraints can’t be NULL-able. In modern flavours of SQL Server, this is demonstrably false. If the field is NULL, the constraint just isn’t checked.


Part 3: Keeping track

By clearing up one locus of ambiguity, I’m afraid I’m going to introduce a new one. Presumably, we’re going to want to record why our postcode field is NULL. We can either:

(A) Create a new lookup table, ReasonForNull (say); add a new field, ReasonForNullID, to our Address table, add a suitable foreign key, and a CHECK constraint that says “if the Postcode is NULL, then ReasonForNullID must not be NULL – and vice versa”, e.g.:

ALTER TABLE xyz.[Address]
ADD CONSTRAINT CK_Address_PostcodeOrReason 
CHECK( (Postcode IS NOT NULL AND ReasonForNullID IS NULL)
	OR (Postcode IS NULL AND ReasonForNullID IS NOT NULL)
)

or

(B) Create our new lookup table (as above), but also create another new table, Address_ReasonForNull, like so:

CREATE TABLE xyz.Address_ReasonForNull
(
	AddressID INT NOT NULL
		CONSTRAINT PK_Address_ReasonForNull
		PRIMARY KEY CLUSTERED
	,ReasonForNullID TINYINT NOT NULL
	,CreatedOn DATETIME NOT NULL
		CONSTRAINT DF_Address_ReasonForNull_CreatedOn
		DEFAULT(GETDATE())
	,CONSTRAINT FK_Address_ReasonForNull_AddressID
		FOREIGN KEY (AddressID)
		REFERENCES xyz.Address(AddressID)
	,CONSTRAINT FK_Address_ReasonForNull_ReasonForNullID
		FOREIGN KEY (ReasonForNullID)
		REFERENCES xyz.ReasonForNull(ReasonForNullID)
)

and only INSERT into it when we have an invalid postcode.

Neither (A) nor (B) is a perfect solution. (A) will waste a byte per Address record (if ReasonForNullID is declared as a TINYINT) if the postcode is ok, but has the advantage of strictly maintaining integrity, thanks to the CHECK constraint. (B) wastes no space, but there is no simple way (that I know of) of enforcing that a child record must exist, given data in the parent record.

If we want to record, say, the postcode that was entered but not validated, then it’s no bother under scenario (B) to add a new field to our Address_ReasonForNull table:

ALTER TABLE xyz.Address_ReasonForNull
	ADD OriginalData VARCHAR(20) NULL

However, if we were doing (A), then we’d have to add this column to the main Address table (and change the CHECK constraint); potentially, we could waste a lot of space.

Personally, I’d favour (B), and would push for all data changes to be made via stored procedures (aka sprocs). That way, I can ensure that the data in my two tables is kept perfectly in sync.

Any thoughts or comments? Feel free to let us know!

Advertisements

, , , ,

Leave a comment

People aren’t people: when matching them can be hard

I’m going to spend a few posts on a subject that is, frankly, the bane of my life: people matching. That is, given two sets of person-related details, do I believe they are the same person? It’s eminently useful for many things including keeping marketing costs down, improving customer service, and very importantly, preventing fraud.

If your dataset(s) contain a unique person key, e.g. Social Security Number in the USA, or National Insurance Number here in the UK, then the task is obviously pretty simple (barring errors in the data). If there’s no unique person key, you’ve got a great deal more work to do. I’d say it follows a 95 / 5 rule: to match the first 95% of your dataset takes 5% of the time, the 5% that’s left takes the remaining 95% of the time. (Hence why it causes me grief: you can end up writing reams of code to match a handful of details, in a never-ending quest for greater accuracy!)

Before I start discussing how I’d do people matching in a “perfect world” scenario, I’m going to list some of the problems I’ve encountered when trying to match data from UK sources.

Names

  • Shortened or alternative forms of the first name: e.g. Bill / William, Peggy / Margaret, Jack / John. And these days, Alfie probably isn’t short for Alfred, just as Harry probably isn’t short for Harold (or even a boy’s name).
  • As per the above, I wouldn’t ever assume a particular first name implies a gender; you’ll be wrong at some point, and an awkward conversation might ensue.
  • Using middle names as first names; famous examples include Hannah Dakota Fanning, William Bradley Pitt, Walter Bruce Willis, James Paul McCartney, Laura Jeanne Reese Witherspoon.
  • Married names, people taking their spouse’s last name, without any restrictions on gender.
  • Double-barrelling last names with spouse or partner.
  • Very common names – names like ‘George Smith’ and ‘Claire Wilson’ mean placing more reliance on other pieces of information when matching.

Titles

  • In my experience, Mr/Ms/Miss/Mrs etc. are rarely correct enough to rely on to indicate gender or married status*, even when the primary source is data the customer has entered themselves. Also, the gender-neutral Mx is becoming increasingly common.
  • Let’s not even get into the realms of Professor, Doctor, Lord/Lady, Reverend and assorted military titles…

* Using gender and married status purely as aids to matching people, nothing else.

Dates of birth

It’s very easy to get the date of birth wrong with mis-typing, or getting the month and day the wrong way round. Also, people (a) don’t like to give their birthdate out, so may give a dummy one (1st Jan 1970 is common), or (b) will lie about their age if they think it improves their chances of obtaining a product or service.

People with “non-traditionally British” names

  • People from other countries adopting a Western-style first name alongside their traditional birth-name (e.g. Chinese people).
  • First names / family names may not be in the expected order (again, e.g. Chinese).
  • Names that have more than one translation into English, e.g. Mohammed / Muhammad / Mohamed.
  • Different character sets! Greek, Cyrillic, Arabic, etc.

(“Non-traditionally British” is an ugly turn of phrase, there must be a better way of putting it…)

Family

  • Fathers and sons with exactly the same first, middle and last names. (Far more common than you’d think!)
  • Twins; especially twins with very similar first names (Mia/Mya, Ethan/Evan).
  • You can’t reliably infer relationships using only differences in age; two customers from the same family, 32 years apart, could potentially be siblings, parent/child, or even grandparent/grandchild.

Addresses

  • Living at more than one address; in particular, students living away from home.
  • Moving house, sometimes within the same postcode, or even next door.
  • Postcodes not existing yet on the Postcode Address File, although you may find them on Google Maps(!)
  • Postcodes becoming invalid / retired, e.g. postcodes in the districts BS12, BS17-19.
  • Postcodes becoming valid: the district E20 was previously used only for the fictional TV soap Eastenders, but postcodes in this district have now started to be allocated for real addresses.
  • Roads can be renamed [BBC]
  • Buildings can be split into flats.
  • Different naming conventions; flats in Scotland can be named by floor number / flat number, e.g. 2/1 (2nd floor, 1st flat).

Some address-related problems can be solved by using the Unique Property Reference Number (UPRN) or the Unique Delivery Point Reference Number (UDPRN) to represent the address, but neither of these has widespread adoption yet.

Email addresses

  • Having more than one email address.
  • Labels, e.g. fred.smith+SPAM@mailbox.com and fred.smith+NOTSPAM@mailbox.com. The canonical version of the email address would be fred.smith@mailbox.com, which may be more useful for matching purposes.
  • Temporary inboxes, e.g. Mailinator.
  • Format: Validating the syntax of 99% of email addresses is straightforward, getting the full 100% is almost impossible. See here [wikipedia] for a brief explanation about which characters are allowed in an email address.

Mobiles

Home phone numbers

  • Having more than one home phone number.
  • Not having a phone number, but entering one belonging to a friend or relative.
  • Not having a phone number, so using the number of a local taxi firm, public house, or fast-food restaurant (again, more common than you might think).

Bank accounts

  • Having more than one bank account
  • People not moving their bank accounts when they move house. (I live 80 miles away from my nominal branch.)
  • Sort codes changing, especially when banks merge or split.
  • Joint bank accounts
  • Business bank accounts

Debit and credit cards

You almost certainly shouldn’t be storing card details…! [www.theukcardsassociation.org.uk]

Incorrect details

  • Accidental mis-typing
  • Deliberate fraud – typically, the name and address might be real, but the mobile and email will be the fraudster’s.
  • System-testing : internal (dev or UAT environment) vs. external (penetration testing), manual/automated, regular (e.g. employees) / irregular (e.g. competitors testing capabilities; hackers!)
  • Details not existing: some people don’t have home telephone numbers (so put their mobile number in that field instead), whereas other people don’t have mobiles (so they put their home number instead).
  • People just messing around, possibly not-very-maliciously.

Other

  • Older people using younger family members’ email addresses and/or mobile numbers.
  • People who work overseas and have non-UK mobile number and address; they could be a valid customer, as per your policies, but with only non-UK contact details. Do your systems accept a phone number that doesn’t start +44?
  • Driving License / Passport : most existing systems only validate the format of the identifying numbers, which makes them a target for fraudsters. Newer systems can validate images of the documents.
  • Device IDs are great for fraud detection, but can present a problem when matching people; families often share devices, and what about public computers in libraries and internet cafes?
  • Electoral Roll: Being on the full electoral roll at an address is no guarantee that the person is living there, and the converse is also true.

Third-party services exist to validate/verify almost all the information above, singularly and together. However, none of the services are perfect, so matching person-level data comes down to cost (third party data and development time), and your tolerance for mistakes – how embarrassing might it be if you get it wrong?

If you have any examples of when matching personal details has proved trickier than you thought it was going to be, please let me know in the comments below!

, , , , , ,

Leave a comment