Archive for category systems

Security Through Obscurity

In my previous post, I wrote about GUIDs (aka UUIDs), marvellous datatypes that allow you to uniquely identify a piece of information. Because the probability of randomly generating two the same is so small, there’s a sense that merely possessing one as a key to a piece of data, means the data is somehow secured. This is false, in two senses — the first sense, not so critical; the second, very critical.

Before that, let’s look at how they might be used. I’ll use URLs / webpages as an example, because it’s a common application.

Let’s say I’ve written my own blogging system. The text of each article is stored in a database table, and the primary key of the table uniquely identifies my blog posts. When I list my blog posts so people can click on them, the URL might look like this:

The unique identifier — in this example, the number 17 — forms part of the URL, and someone can look at all my blog posts merely by changing the number after the equals sign. It’s ok, it’s only a blog; the worst they can do is put in a number of a post I haven’t published yet. Depending on how I’ve written my blogging engine, they’ll either see an unfinished post, or (preferably) a message saying there’s currently no live article with that id.

In the bad old days of internet website development, lesser developers used integer IDs to refer to data more sensitive than a blog post:

(Of course, I’m making these URLs up!) The website might ask you to log into some part of the site via a secure method, but once you were in, you’d be able to view the records of anyone on the system! Clearly a terrible thing, but this type of security hole has always existed; here’s an example from a few years ago, concerning Apple’s password reset process.

But it’s ok, we know about GUIDs now! GUIDs to the rescue! If the URL looks like this:

, and the website retrieves each patient’s private information via a GUID, then we’re perfectly safe, no?

No, not safe (sense 1)

In the sense that a GUID is ‘hard for a human to guess’, yes, you’re pretty safe. No-one is going to randomly type in some digits and hit upon a valid GUID. BUT:

1. What if it’s not someone typing? What if it’s a piece of software making thousands of guesses per second? What if it’s a distributed botnet, millions of pieces of software each making thousands of guesses per second? * And your site has millions of users? The probability of making at least one correct guess would not be negligible.

[* Presuming your server can cope with all those requests…]

2. What if your GUID isn’t all that random? There are people out there who, given enough examples, can derive information about your server’s random number generator, and make accurate guesses about historical and future GUIDs. The situation is worse still if you’ve used sequential GUIDs; and potentially catastrophic if the developer who’s written the code hasn’t really understood the point of them (see previous post).

Point 1 can be addressed by putting checks in place for incorrect guesses. Modern server set-ups will allow you to block IP addresses after unusual patterns of requests have been detected.

Point 2 is harder to address: how truly random is the code that generates your GUIDs? In practice, you just don’t know. The study of random number generation is an entire academic discipline of study, you could devote the rest of your life to assessing various random number generators!

But the above doesn’t matter because…

No, not safe (sense 2)

This is the crux of the matter: obscurity (hiding something) is not security. All it does is make something harder to locate, not impossible. Just because you can’t guess a GUID, it doesn’t mean there aren’t other ways of obtaining them, e.g.:

  • Hacking
  • Leaking (e.g. via accidental email forwarding)
  • Social engineering
  • Mistakes (e.g. devs ‘temporarily’ storing them in webpage HTML)

The point is that proper security needs to be applied on top, in all cases. Where sensitive information is concerned, people should be logging in securely, with as robust a system as it’s possible to use. So even if a bad person stumbles across information (e.g. GUIDs) they shouldn’t have, they can’t use it, because they don’t have the requisite access.

In conclusion

To sum up, using a GUID on its own is nowhere near adequate-enough security, there’s much more to be done. A final tip: Given that there’s no such thing as “100% secure”, your goal should be to record all user activity on your site / app, and dive into it regularly to check for breaches or unusual patterns. As a bonus, you get to see how users are really using your software — I promise there’ll be some surprises in there!


, ,

Leave a comment


I’m a big fan of passing data to and from stored procedures (sprocs) as XML, especially XML that represents a complete object, or a list or hierarchy of objects. For a start, XML is perfectly human-readable (if you’re doing it right), and nearly every system and language knows how to work with it, SQL Server / TSQL included. What makes it even better, is being able to validate the XML before you even begin to parse it, using an XSD (XML Schema Definition).

Here’s a complete example you can copy and run:

USE tempdb

-- If the XSD already exists, drop it:

    FROM sys.xml_schema_collections xsc

-- Create the schema:

<xsd:schema xmlns:xsd="">

  <xsd:simpleType name="ST_EmailAddress">
      <xsd:restriction base="xsd:string">
      <xsd:pattern value="[^@]*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9}"/>

  <xsd:simpleType name="ST_Usage">
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="home"/>
      <xsd:enumeration value="work"/>
      <xsd:enumeration value="other"/>

  <xsd:complexType name="CT_EmailAndUsage">
      <xsd:extension base="ST_EmailAddress">
        <xsd:attribute name="usage" use="required" type="ST_Usage" />

  <xsd:element name="emailList">
        <xsd:element name="email" type="CT_EmailAndUsage" minOccurs="1" maxOccurs="3" />


-- Make some dummy data that conforms to the schema above:

DECLARE @testXML AS XML(TestSchema)

SET @testXML = '
  <email usage="home"></email>
  <email usage="work"></email>
  <email usage="other"></email>

-- Query it:

    id = ROW_NUMBER() OVER (ORDER BY e.i)
    ,EmailAddress = e.i.value('(.)[1]','VARCHAR(255)')
    ,Usage = e.i.value('(@usage)[1]', 'VARCHAR(20)')
  FROM @testXML.nodes('//email') AS e(i)

The result set is:

id   EmailAddress    Usage
---  --------------  ------
1   home
2   work
3  other

(3 row(s) affected)

Now, try messing around with the contents of the @testXML variable, e.g.:

  1. Set usage to a string that’s not ‘home’, ‘work’ or ‘other’
  2. Add a fourth email address
  3. Take the ‘@’ symbol out of an email address
  4. Put in some extra nodes that don’t belong

,then re-run the code. They all fail, because the XML has to conform to the XSD we’ve defined as TestSchema. So, SQL Server automatically rejects any input that fails data validation (e.g. format of email address) or breaks business logic (‘no more than three emails’); if the XML was being passed to a sproc, the call would fail, and no code inside would ever run.

Obviously, you may not want to automatically reject ‘broken’ XML, you’ll probably want to record this fact. That’s fine – your code (sproc) can accept a schema-less XML, and attempt the cast itself; and if it fails, you can respond however you like.

There’s certainly an overhead in learning the language of XSDs, but because it’s a generic technology, there are plenty of online resources, e.g. w3schools. When it comes to transferring complex objects around as data, I don’t know of a better way than using XML and XSD.


Because Microsoft haven’t got round to coding it yet, you can’t query the text value of a node that’s been defined as a type in XSD. That is, I’d ordinarily like to be able to query the email address itself like this:

EmailAddress = e.i.value('(./text())[1]','VARCHAR(255)')

, because directly accessing the node text is faster (presumably because it doesn’t have to do any more parsing). But sadly, it’ll just fail with an error. However, this is unlikely to cause practical problems, it’s just a mild annoyance that’s vastly outweighed by the benefits that come from validated XML.

, ,

Leave a comment

The meaning of NULL, and why magic data is bad

Part 1: NULL

As a junior web developer, I remember other developers warning me about database NULLs: “you don’t really want to deal with them”, “code them out if you can”, “program around them”, “turn them into something else”. Dear reader, they were quite wrong.

For a piece of data, a NULL value can mean any of:

  1. We don’t know this.
  2. We don’t know this yet (but there’s an expectation we’ll know this at a later date).
  3. We don’t know this, because the question of “What is the value of X for object Y?” is not applicable here.
  4. We don’t know this, and the chances of us ever knowing it are practically zero.
  5. It is logically impossible for this data to exist.

Context usually clues us into which meaning of NULL we’re looking at; if a customer’s email address field is NULL, then in terms of the above:

  1. It hasn’t been asked for / provided / collected.
  2. It hasn’t been asked for / provided / collected yet, but we might be getting this data in the future.
  3. The customer doesn’t have an email address.
  4. The rest of the customer’s data is incorrect or missing, so we have no means of contacting them to find their email address.
  5. The ‘customer’ isn’t something capable of owning an email address(*).

Regardless, if a customer doesn’t have an email address in the system, any code that consumes customer data will have to cope in a sensible manner. If the code is a data entry form, then an empty text field will be displayed; but if the code does marketing mail-outs, then it’ll just have to skip that record.

(*) It could be that the table is ‘multi-use’, and the field makes no sense for some types of data.

Going back to meaning (5), a couple of better examples might be:

  • ‘O’ level GCSE results (at age 16):   My age cohort did GCEs, therefore it is logically impossible for anyone to ascertain my GCSE results.
  • Date of last gynaecological exam:   Clearly, this would never be applicable for anyone born genetically male.

(In multivariate analysis, these would be referred to as structural zeros, rather than the usual sampling zeros. “It was impossible for it to occur” vs. “We did not see it occur”.)

Despite NULL being the very embodiment of “no information”, sometimes “no information” is the information in itself! Trivially, e.g., a SQL query to find all customers without email addresses, will specifically look for the NULL value in that field. Data with NULLs in be indexed, same as any other data. You can even create a filtered index that goes straight to the NULL data:

  ON dbo.Customer(Email)

So a NULL value is generally not something to be avoided, modified, or coded around. It is eminently useful, and a vital part of your data structures.

Part 2: Bad magic

Now, I started with part 1 because of a common pattern I see used in data capture, usually due to novice / junior / misguided developers. An example: I have a database table of addresses (called Address), with the usual fields. My company operates strictly within the UK, so in an effort to keep our data as clean as possible, we have a CHECK constraint on the Postcode field; not a foreign key to a table of all postcodes ever (who wants to maintain that??), but a simple check against the UK postcode format. The check will prevent entries like “unknown”, or mistakes like “SW1A IAA” (‘I’ instead of ‘1’). Also, the postcode is ‘NOT NULL’-able — because every address has a postcode, right?

It might look like this:

CREATE TABLE xyz.[Address]
	,Line2 VARCHAR(255) NULL
	,Postcode VARCHAR(10) NOT NULL
		CHECK (Postcode LIKE '[A-Z][0-9] [0-9][A-Z][A-Z]'
		OR Postcode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]')

(Clearly the CHECK constraint isn’t exhaustive: as it is, it’ll reject SW1A 1AA, the postcode of Buckingham Palace. It’ll do for illustrating the point.)

If customer data is supplied without a postcode, then any INSERT will fail. What tends to happen, is that over time, you’ll see the Postcode field start to contain values like ZZ1 1ZZ; a value that passes our simple CHECK constraint rules, but is probably not a valid UK postcode.

So how did ZZ1 1ZZ get into the database?

Scenario 1a:

The developer coding the application form tried to INSERT a record with no postcode, thus the operation failed with an error. So in the input form, they put some code to change a blank postcode to ZZ1 1ZZ when INSERT-ing.

Scenario 1b:

The customer input form hooks up to an address validator; if the address cannot be validated, then the customer is asked to fill in all the fields themself, and can easily put in an invalid postcode which doesn’t make it past the simple check constraint on the Address table. The developer dutifully catches the error, changes the postcode to ZZ1 1ZZ and re-INSERTs.

Scenario 2:

A customer complained about being marketed to, and needs to be removed from the database as soon as possible. To do it properly would mean changing code in several systems; the quick hack is to change their postcode to ZZ1 1ZZ, then make sure the mail-out query ignores records with that postcode value. This is then adopted as semi-official practice: “To remove a customer from marketing, just set their postcode to ZZ1 1ZZ.”

There are multiple problems with having a postcode of ZZ1 1ZZ meaning ‘unknown’, ‘error’ or ‘do not contact’:

  1. It’s a ‘magic’ string; for it to have system-wide meaning, every single system must understand it, and what it represents. What if someone INSERT-ed ZZ2 2ZZ? It wouldn’t be similarly understood, it would be treated as a real postcode.
  2. Every new developer and analyst has to be told about the magic string. What if there’s a magic string for every piece of data? Ok, it could be solved by using VIEWs, but then that’s more code that has to be known about, and scrupulously maintained.
  3. What if, by some mistake, post is sent out to that postcode? (This will happen, I guarantee it.) One of your other systems is likely recording the fact that mail has been sent correctly, but the chances of it arriving are slim.
  4. The real postcode ZZ1 1ZZ might not exist now, but it may in the future: there are many examples of postcodes falling into and out of use. How will you know if your postcode is genuine, or a magic string? Take note: postcodes that start ZZ99 are real live NHS “pseudo-postcodes”…

As you’ve probably realised, my answer would be to make the postcode field NULL-able(*), and to INSERT a NULL in the case of missing or broken data, completely avoiding any magic strings. It needs no special handling, and contextually, it very probably has a limited range of well-understood meanings; e.g. if you see a field MiddleName that is NULL for some records, you would presume it to mean the Customer has no middle name.

Note this is why in the email example in Part 1, we shouldn’t use a blank string instead of a NULL – because a blank string is still a ‘magic’ string, just one that would happen to be widely understood. There will be cases when a blank string legitimately means something quite different to a NULL.

(*) I’ve heard people claim that fields with CHECK constraints can’t be NULL-able. In modern flavours of SQL Server, this is demonstrably false. If the field is NULL, the constraint just isn’t checked.

Part 3: Keeping track

By clearing up one locus of ambiguity, I’m afraid I’m going to introduce a new one. Presumably, we’re going to want to record why our postcode field is NULL. We can either:

(A) Create a new lookup table, ReasonForNull (say); add a new field, ReasonForNullID, to our Address table, add a suitable foreign key, and a CHECK constraint that says “if the Postcode is NULL, then ReasonForNullID must not be NULL – and vice versa”, e.g.:

ALTER TABLE xyz.[Address]
ADD CONSTRAINT CK_Address_PostcodeOrReason 
	OR (Postcode IS NULL AND ReasonForNullID IS NOT NULL)


(B) Create our new lookup table (as above), but also create another new table, Address_ReasonForNull, like so:

CREATE TABLE xyz.Address_ReasonForNull
		CONSTRAINT PK_Address_ReasonForNull
		CONSTRAINT DF_Address_ReasonForNull_CreatedOn
	,CONSTRAINT FK_Address_ReasonForNull_AddressID
		REFERENCES xyz.Address(AddressID)
	,CONSTRAINT FK_Address_ReasonForNull_ReasonForNullID
		FOREIGN KEY (ReasonForNullID)
		REFERENCES xyz.ReasonForNull(ReasonForNullID)

and only INSERT into it when we have an invalid postcode.

Neither (A) nor (B) is a perfect solution. (A) will waste a byte per Address record (if ReasonForNullID is declared as a TINYINT) if the postcode is ok, but has the advantage of strictly maintaining integrity, thanks to the CHECK constraint. (B) wastes no space, but there is no simple way (that I know of) of enforcing that a child record must exist, given data in the parent record.

If we want to record, say, the postcode that was entered but not validated, then it’s no bother under scenario (B) to add a new field to our Address_ReasonForNull table:

ALTER TABLE xyz.Address_ReasonForNull
	ADD OriginalData VARCHAR(20) NULL

However, if we were doing (A), then we’d have to add this column to the main Address table (and change the CHECK constraint); potentially, we could waste a lot of space.

Personally, I’d favour (B), and would push for all data changes to be made via stored procedures (aka sprocs). That way, I can ensure that the data in my two tables is kept perfectly in sync.

Any thoughts or comments? Feel free to let us know!

, , , ,

Leave a comment

The Request / Response pattern

Here’s something I come across too often: how do we deal with sending data to and receiving data from external systems? I don’t mean the format of the data, or the transmission mechanism; but simply, how do we keep a record of what occurred? (And I use the word ‘external’ here in a vague sense; it could mean external to your company, external to the current app domain, or just something that you want to keep separate.)

Let’s pretend our system needs to call out to Amazon to query the best price for a book: we send the Amazon service an ISBN, and it returns a number that represents the price.

The Novice Developer (TND) has created a service, underneath which is a table that looks like this:

CREATE TABLE dbo.BookPrice

(TND probably uses Entity Framework or similar to talk to the database, but I’ll pretend that the usual SQL statements are being used.)

To get a price, the service sends the ISBN to Amazon, gets a response, and INSERTs into the table:

INSERT dbo.BookPrice(ISBN, Price)
	VALUES('9780575082076', 8.99)

, or does an UPDATE if the record already exists. So far, so good. Next time TND runs the service, there’s a problem; the book is out of stock, so Amazon doesn’t return a price. The code does this:

UPDATE dbo.BookPrice
	SET LatestPrice = NULL
	WHERE ISBN = '9780575082076'

An analyst looks at the data, and asks, “Well, why’s that price NULL?”. TND explains, and the analyst points out that there’s no way of telling what a NULL means. It could be any of:

  • The service never attempted to call out to Amazon
  • The service tried to make the call, but there was a network problem
  • The service made the call correctly, but got an error back from Amazon

A manager overhears the conversation, and asks the TND to create a scheduled task that retries calls that didn’t return a price. To facilitate this, the developer changes the table:

ALTER TABLE dbo.BookPrice
	ADD LatestStatus VARCHAR(255) NULL

Now, the UPDATE becomes:

UPDATE dbo.BookPrice 
	SET LatestPrice = NULL
		,LatestStatus = 'Out of stock'
		,LastOKPrice = 8.99
	WHERE ISBN = '9780575082076'

To get the latest price, TND tells the analyst to just run the following query:

SELECT ActualLatestPrice = ISNULL(LatestPrice, LastOKPrice)
	FROM dbo.BookPrice
	WHERE ISBN = '9780575082076'

Sometime later, TND’s service requests the latest price, and the network is down. The service notes this accordingly:

UPDATE dbo.BookPrice
	SET LatestStatus = 'Network problem'
	WHERE ISBN = '9780575082076'

The analyst then points out that the company has SLAs and needs to know if external service calls are failing. TND thinks, and adds a new column:

ALTER TABLE dbo.BookPrice
	ADD NetworkProblemCount INT NULL

and UPDATEs the table again:

UPDATE dbo.BookPrice
	SET NetworkProblemCount = 1
	WHERE ISBN = '9780575082076'

“But we need to know when it happened”, says the analyst. So TND thinks again, and adds another new column:

ALTER TABLE dbo.BookPrice
	ADD LastNetworkProblemDate DATETIME NULL

and UPDATEs the table again:

UPDATE dbo.BookPrice
	SET LastNetworkProblemDate = GETDATE()
	WHERE ISBN = '9780575082076'

Back comes the analyst, “But when was the last time we actually successfully retrieved a price?”, so TND adds another column:

ALTER TABLE dbo.BookPrice
	ADD LastSuccessfulPriceRetrievalDate DATETIME NULL

… and hopefully by this point, you can agree with me that continually adding columns is not the way we should be doing this!

The most straightforward way I know of to record interactions with other systems, is via Request and Response tables. Take a look at the following DDL:

CREATE TABLE dbo.BookPrice_Request
		CONSTRAINT DF_BookPrice_Request_CreatedOn DEFAULT(GETDATE())

CREATE TABLE dbo.BookPrice_ResponseType
	,[Description] VARCHAR(255) NOT NULL
		CONSTRAINT DF_BookPrice_ResponseType_CreatedOn DEFAULT(GETDATE())
INSERT dbo.BookPrice_ResponseType(ResponseTypeID, [Description])
	VALUES(10, 'Price obtained successfully')
		, (20, 'Out of stock')
		, (30, 'Network error')

CREATE TABLE dbo.BookPrice_Response
		CONSTRAINT DF_BookPrice_Response_CreatedOn DEFAULT(GETDATE())
	,CONSTRAINT FK_BookPrice_Response_RequestID
		REFERENCES dbo.BookPrice_Request(RequestID)
	,CONSTRAINT FK_BookPrice_Response_ResponseTypeID
		FOREIGN KEY (ResponseTypeID)
		REFERENCES dbo.BookPrice_ResponseType(ResponseTypeID)

You can run the code above in a test database, and it’ll create three tables: a Request, a Response, and a ResponseType. When our internal service receives a request for a book price, it INSERTs into the Request table. When a reply is received from Amazon, we INSERT into the Response table, with (a) the RequestID we generated earlier, and (b) an indication of what type of response we received.

To recreate what was happening with TND’s service, run the following SQL:

-- Call #1
INSERT dbo.BookPrice_Request(ISBN)
-- The calling code would get the newly-INSERTed RequestID from SCOPE_IDENTITY()
-- This is the first INSERT into the table, so the returned RequestID will be 1.
-- ... our service gets the response from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(1, 10, 8.99)

-- Call #2
INSERT dbo.BookPrice_Request(ISBN)
-- the RequestID returned is 2
-- ... our service gets the response 'Out of stock' from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(2, 20, NULL)

-- Call #3
INSERT dbo.BookPrice_Request(ISBN)
-- the RequestID returned is 3
-- ... our service call fails, there's a problem with the network...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(3, 30, NULL)

Let’s say a fourth service call was completely successful:

-- Call #4
INSERT dbo.BookPrice_Request(ISBN)
-- the RequestID returned is 4
-- ... our service gets the response from Amazon...
INSERT dbo.BookPrice_Response(RequestID,ResponseTypeID,Price)
	VALUES(4, 10, 6.99) -- it's on offer at a reduced price

Then to get the latest price for a book (and when that price was obtained), the query is simply:

	TOP 1
	FROM dbo.BookPrice_Response rp
	JOIN dbo.BookPrice_Request rq
		ON rq.RequestID = rp.RequestID
	WHERE rq.ISBN = '9780575082076'
	AND rp.ResponseTypeID = 10
	ORDER BY rp.ResponseID DESC

Our Request and Response tables have a complete record of everything that has happened in our service, and it’s simple for the analyst to build reports around how efficient the service is, and if it’s working correctly.

Before I go, a quick ‘gotcha’ – it’s fresh in my mind because I noticed this in a particular service’s database a few days ago: every Request record had a Response. Now, while that’s certainly the perfect situation, it’s just not realistic; things go wrong. In this case, it turns out that the INSERT into the Request table was only being done at the end of the process, only when a Response had been successfully received. So errors were being returned from the 3rd party system, but the service code was bombing out – therefore we had no visibility of this in the database. It’s very important: the Request and Response have to be written back to the database as soon as they happen, without delay.

, ,

Leave a comment

People are People: matching them anyway

Following on from the previous post, here’s my own recipe for a people-matching system. I don’t go into all the gory technical details – the post would be five times the length – but I hope there’s enough here to get the theme across.


Let’s assume a few things:

  1. We have a large number of Person records (millions?), all from the UK.
  2. We believe that a significant proportion of these records are duplicates or redundant, as they represent the same people; but the details may differ.
  3. We only want to identify the records that match; we’re not looking to build a single all-encompassing representation of a person, wherein data from one record ‘enriches’ another. (More to come on that subject in a future post.)
  4. The quality of data in each Person record varies, maybe markedly; some records may be customers we have a long-standing relationship with, and are very confident of their details. But other records might come from less trusted or unverified sources, e.g. a survey on the website that promises a voucher in return for answering some questions and entering a few personal details; these details would clearly be less reliable.
  5. Our records have the following details (‘data points’):
    • First name
    • Last name
    • Date of birth
    • Address
    • Mobile phone number
    • Email address

    , plus a unique PersonID. We might have other data points available, e.g. home telephone number, employment details, bank account number / sort code, and we could absolutely use them for matching. For now, we’ll concentrate on the six above.

  6. Every Person record only has one set of details, that is, just the six pieces of information that we’ll use for matching. This restriction isn’t that realistic – people can have more than one mobile number or email address, and they move house, and occasionally change their names. Taking this into account in a system doesn’t significantly alter the structure of what’s being proposed here.

Canonical data

For each one of these data points, we want to generate and store what is known as the canonical version: that is, the version that is as standard as we can possibly make it. For names, the simplest thing to do is remove spaces, hyphens and apostrophes, which will help us match “Peter O’Toole” against “Peter Otoole”, or “Andrew Lloyd Webber” against “Andrew Lloyd-Webber”. Dates of birth don’t need changing, as long as they’re being stored in a sensible format (so in SQL we should be using DATE, and not VARCHAR). Phone numbers might be accepted in our system as “+442077319333” or “020 7731 9333”, but we need to translate them into one format – I’d go for the former.

Email addresses: As mentioned in the the previous post, email domains can be synonyms of each other, e.g. and Depending on your data, making domains canonical might be too much effort. Also, with some providers (including gmail) the local part of the address (the left-hand side of the @ symbol) can contain labels, so you’ll need to strip those out to make the canonical version.

Addresses: Making canonical versions of bricks-and-mortar addresses can be the biggest headache, which is why you’ll need an address validator: a piece of software that takes an unformatted address, and turns it into a formatted address. Examples of address validators include PCA Predict (formerly Postcode Anywhere) and Experian Data Quality (formerly QAS). The way they work is that if we present, e.g.



10 Downing Street
City Of Westminster

to the validator, then we’ll get the same result back:

10 Downing Street

If you pay for the extra data, then you can get the UDPRN (Unique Delivery Point Reference Number), which maps every postal address into an 8-digit code. If not, you can make your own loose version of it by hashing the text of the validated address, which you can then use for comparing.

From this point on, I’ll assume (a) that all our data points are canonical, and (b) there are no known ‘test’ people within our data – if there are, we should remove them from any matching process.

Excluding common data points

Next, we’ll need to consider generating lists of common and/or dummy data. For example, what are the most popular mobile numbers in the data?

In SQL, our query would look something like this:

	TOP 20
		Mobile, COUNT(1) AS Total
	FROM dbo.Person
	GROUP BY Mobile

If your data comes to you via third parties (e.g. affiliates, introducers, brokers), I should imagine you’ll see something like:


with some surprisingly large counts against each. You clearly can’t use these mobile numbers for matching; in fact, you should look at all the associated Person records, to see if it’s worth excluding everyone with obvious dummy numbers like these. But you’ll probably just want to exclude this data point from the match, not exclude the entire Person record.

Similarly, you’ll need to do the same for email addresses, e.g.,, etc.; and home telephone numbers, if you capture them. (As mentioned previously, when faced with a home telephone number field, some people like to enter phone numbers for their local Chinese restaurants, taxi firms and pubs.)

It’s not as necessary to identify common names and addresses, although you might see a few Joe Bloggs, John Smiths and David Camerons. [Brief aside: it’s always useful to have a table containing the headline details of famous people, so you can stop them getting into your system at the point of entry. It’s unlikely you’ll end up with the Prime Minister’s mobile number or email address in your database – but his name and address are no secret!]

Hard matches

We’re at the point where our data is as standard (canonical) as we can make it, and we have lists (tables) of data points we know we’re NOT going to be matching.

The next step is to ask: Given two sets of details, by what criteria would we confidently say “yes, these two people are the same”? A complete match would be:

First name, Last name, Date of birth, Address*, Mobile, Email, Home phone 

, i.e. every single piece of information – but that’s the most obvious case, and there are certainly subsets of these details that we’d be happy to accept.

(* Address meaning either a UDPRN or a hashed version of the text data returned from the validator.)

If the details of Person A matched those of Person B, for any set of the following details, I’d consider the match to be a hard match:

  • Set 1: First name, Last name, Postcode, Mobile, Email
  • Set 2: First name, Last name, Address, Mobile
  • Set 3: First name, Last name, Address, Email
  • Set 4: Last name, Date of birth, Postcode, Mobile, Email
  • Set 5: First name, Last name, Date of birth, Address
  • Set 6: Date of birth, Address, Mobile
  • Set 7: Date of birth, Address, Email

where Set 1 is the best, and Set 7 the ‘least best’. (You, and your data, of course may disagree.) So, every single piece of information in a set has to be identical between person A and person B for us to consider it a match.

Note that no set of details is a subset of any other; the larger set of details would be redundant.

Remember that, if at all possible, our criteria shouldn’t match the following people:

  • Twins
  • Parents and children with the exact same name
  • Older relatives borrowing mobile numbers and email addresses from other family members

For each Person record, we’ll make (up to) 7 separate hashes, one for each set of details, and store them in a new table. An MD5 hash takes up 16 bytes, so if we had a million records, it’s a maximum of 150MB in space (if we include numeric IDs for the Person and for the set of details).

Finding hard matches between Person records then becomes as simple as finding hashes with more than one PersonID entry in our table – a simple GROUP BY, which if indexed properly, will be lightning quick.

(NB: If we’ve decided that a piece of information in a Person record isn’t suitable for matching, due to being too common or an obvious dummy, then we can’t generate any hashes that include that piece of information; we can’t just put a blank in, the match would be too loose. That’s why it’s “(up to) 7” – not “7” – hashes.)

What next?

What about the rest of our data? We’ve three options:

  1. Do nothing. Depending on our requirements, this might be ok; in the grand scheme of things, what’s a handful of duplicate emails / letters / phone calls?
  2. Manual matching. Again, depending on our situation, this might be perfectly feasible. If there’s any human intervention (for example, a phone call) in our end-to-end process, then it could be straightforward to give your customer-facing employees a screen that says “We’ve found these sets of details that might be the same person, do you agree?”, then let them approve or reject the match.
  3. Scoring. Give various amounts of points to each part of the match criteria, depending on the strength of match between data points. If you like, you can consider this a soft match (cf. the hard matching above).
  4. The second and third options require that you’ve some reason to believe a match might exist in the first place, e.g. an email address or mobile number in common; or a combination of date of birth and postcode. With the right indexes on the data, it’s trivial to generate these candidate lists quickly. (Of course, you don’t need to attempt to match Person records you’ve already matched against each other.)


    The goal is to come up with a scoring mechanism such that likely matches get more ‘points’ than unlikely ones. Outside of some Very Hard Computer Science (probably involving neural networks), I don’t know of any standard ways of generating scorecards where the outcome is unknown – if the outcome is known, then standard methods apply (Google: Building a scorecard.) However, for this particular application, common sense should get us where we want to go.

    For string data like names (first or last), we can make an educated guess at a scoring mechanism, e.g.

    • Full match: 10 points
    • One letter out: 9 points
    • Name A is contained within B (or vice-versa), e.g. “Carter” would match “Knowles-Carter”: 7 points
    • First n letters of A = First n letters of B, where 1 ≤ n ≤ 4: n points
    • Last n letters of A = Last n letters of B, where 2 ≤ n ≤ 5: n-1 points
    • Both names are longer than 5 letters, and have 2 letters incorrect: 2 points

    I don’t know that the above is optimal, but it feels like a good starting point.

    How do we assess how far apart strings are? We need a function that, given two strings, calculates some notion of ‘distance’ between them. Such functions already exist, and the Levenshtein distance (the number of insert/replace edits needed to turn one string into another) is a popular choice.

    For non-string data like dates of birth, we’d start with something like:

    • Full match: 8 points
    • One day out: 7 points
    • Month / day swapped: 6 points
    • One month out: 5 points
    • One year out : 4 points
    • Day and month correct: 3 points

    etc. I’m sure you get the idea: the point is we have lots of options!

    Hence, we build up a scorecard that takes two sets of details and returns a single number, a score. We can apply this scoring function to every pair of details in our candidate list, and generate a score for all matches. Inspecting these matches by eye, it becomes fairly obvious what a good threshold for acceptance would be, and what changes to our scorecard are needed. Within a few iterations, we should have a scorecard and threshold that give us the balance we need between matching correct sets of details (where correctness is assessed by eye, by manually checking a sample), and not matching incorrect sets of details.

    The scorecards can get as complicated as you like: you can award extra points for an uncommon domain in the email addresses, or for having the same post town, or the same home phone number area code; you can subtract points for too many repeated digits in the mobile numbers, or having 1st January as the date of birth, or having a common name (‘John Smith’). The law of diminishing returns applies in spades – you have to determine whether the extra effort is warranted.

    One very important thought: you MUST record exactly how a match between two sets of details was made. Was it a hard match? If so, which one? If the match was via a scorecard, which scorecard was it? (You’ll end up with more than one.) What was the score? What was the threshold that it passed? If you don’t record this information, you can’t hope to reliably improve on the matching process in future.


    I’ve not finished on this subject yet, but hopefully this article gives you some ideas about building your own person-matching system, should you require one. I’ve yet to cover:

    1. Generating single person objects from matches
    2. How you practically use such objects within your applications

    I’ll come back to these topics soon.

    As ever, if you’ve any comments, I’d love to hear them!

, ,

Leave a comment

People aren’t people: when matching them can be hard

I’m going to spend a few posts on a subject that is, frankly, the bane of my life: people matching. That is, given two sets of person-related details, do I believe they are the same person? It’s eminently useful for many things including keeping marketing costs down, improving customer service, and very importantly, preventing fraud.

If your dataset(s) contain a unique person key, e.g. Social Security Number in the USA, or National Insurance Number here in the UK, then the task is obviously pretty simple (barring errors in the data). If there’s no unique person key, you’ve got a great deal more work to do. I’d say it follows a 95 / 5 rule: to match the first 95% of your dataset takes 5% of the time, the 5% that’s left takes the remaining 95% of the time. (Hence why it causes me grief: you can end up writing reams of code to match a handful of details, in a never-ending quest for greater accuracy!)

Before I start discussing how I’d do people matching in a “perfect world” scenario, I’m going to list some of the problems I’ve encountered when trying to match data from UK sources.


  • Shortened or alternative forms of the first name: e.g. Bill / William, Peggy / Margaret, Jack / John. And these days, Alfie probably isn’t short for Alfred, just as Harry probably isn’t short for Harold (or even a boy’s name).
  • As per the above, I wouldn’t ever assume a particular first name implies a gender; you’ll be wrong at some point, and an awkward conversation might ensue.
  • Using middle names as first names; famous examples include Hannah Dakota Fanning, William Bradley Pitt, Walter Bruce Willis, James Paul McCartney, Laura Jeanne Reese Witherspoon.
  • Married names, people taking their spouse’s last name, without any restrictions on gender.
  • Double-barrelling last names with spouse or partner.
  • Very common names – names like ‘George Smith’ and ‘Claire Wilson’ mean placing more reliance on other pieces of information when matching.


  • In my experience, Mr/Ms/Miss/Mrs etc. are rarely correct enough to rely on to indicate gender or married status*, even when the primary source is data the customer has entered themselves. Also, the gender-neutral Mx is becoming increasingly common.
  • Let’s not even get into the realms of Professor, Doctor, Lord/Lady, Reverend and assorted military titles…

* Using gender and married status purely as aids to matching people, nothing else.

Dates of birth

It’s very easy to get the date of birth wrong with mis-typing, or getting the month and day the wrong way round. Also, people (a) don’t like to give their birthdate out, so may give a dummy one (1st Jan 1970 is common), or (b) will lie about their age if they think it improves their chances of obtaining a product or service.

People with “non-traditionally British” names

  • People from other countries adopting a Western-style first name alongside their traditional birth-name (e.g. Chinese people).
  • First names / family names may not be in the expected order (again, e.g. Chinese).
  • Names that have more than one translation into English, e.g. Mohammed / Muhammad / Mohamed.
  • Different character sets! Greek, Cyrillic, Arabic, etc.

(“Non-traditionally British” is an ugly turn of phrase, there must be a better way of putting it…)


  • Fathers and sons with exactly the same first, middle and last names. (Far more common than you’d think!)
  • Twins; especially twins with very similar first names (Mia/Mya, Ethan/Evan).
  • You can’t reliably infer relationships using only differences in age; two customers from the same family, 32 years apart, could potentially be siblings, parent/child, or even grandparent/grandchild.


  • Living at more than one address; in particular, students living away from home.
  • Moving house, sometimes within the same postcode, or even next door.
  • Postcodes not existing yet on the Postcode Address File, although you may find them on Google Maps(!)
  • Postcodes becoming invalid / retired, e.g. postcodes in the districts BS12, BS17-19.
  • Postcodes becoming valid: the district E20 was previously used only for the fictional TV soap Eastenders, but postcodes in this district have now started to be allocated for real addresses.
  • Roads can be renamed [BBC]
  • Buildings can be split into flats.
  • Different naming conventions; flats in Scotland can be named by floor number / flat number, e.g. 2/1 (2nd floor, 1st flat).

Some address-related problems can be solved by using the Unique Property Reference Number (UPRN) or the Unique Delivery Point Reference Number (UDPRN) to represent the address, but neither of these has widespread adoption yet.

Email addresses

  • Having more than one email address.
  • Labels, e.g. and The canonical version of the email address would be, which may be more useful for matching purposes.
  • Temporary inboxes, e.g. Mailinator.
  • Format: Validating the syntax of 99% of email addresses is straightforward, getting the full 100% is almost impossible. See here [wikipedia] for a brief explanation about which characters are allowed in an email address.


Home phone numbers

  • Having more than one home phone number.
  • Not having a phone number, but entering one belonging to a friend or relative.
  • Not having a phone number, so using the number of a local taxi firm, public house, or fast-food restaurant (again, more common than you might think).

Bank accounts

  • Having more than one bank account
  • People not moving their bank accounts when they move house. (I live 80 miles away from my nominal branch.)
  • Sort codes changing, especially when banks merge or split.
  • Joint bank accounts
  • Business bank accounts

Debit and credit cards

You almost certainly shouldn’t be storing card details…! []

Incorrect details

  • Accidental mis-typing
  • Deliberate fraud – typically, the name and address might be real, but the mobile and email will be the fraudster’s.
  • System-testing : internal (dev or UAT environment) vs. external (penetration testing), manual/automated, regular (e.g. employees) / irregular (e.g. competitors testing capabilities; hackers!)
  • Details not existing: some people don’t have home telephone numbers (so put their mobile number in that field instead), whereas other people don’t have mobiles (so they put their home number instead).
  • People just messing around, possibly not-very-maliciously.


  • Older people using younger family members’ email addresses and/or mobile numbers.
  • People who work overseas and have non-UK mobile number and address; they could be a valid customer, as per your policies, but with only non-UK contact details. Do your systems accept a phone number that doesn’t start +44?
  • Driving License / Passport : most existing systems only validate the format of the identifying numbers, which makes them a target for fraudsters. Newer systems can validate images of the documents.
  • Device IDs are great for fraud detection, but can present a problem when matching people; families often share devices, and what about public computers in libraries and internet cafes?
  • Electoral Roll: Being on the full electoral roll at an address is no guarantee that the person is living there, and the converse is also true.

Third-party services exist to validate/verify almost all the information above, singularly and together. However, none of the services are perfect, so matching person-level data comes down to cost (third party data and development time), and your tolerance for mistakes – how embarrassing might it be if you get it wrong?

If you have any examples of when matching personal details has proved trickier than you thought it was going to be, please let me know in the comments below!

, , , , , ,

Leave a comment