Archive for category process
In my previous post, I wrote about GUIDs (aka UUIDs), marvellous datatypes that allow you to uniquely identify a piece of information. Because the probability of randomly generating two the same is so small, there’s a sense that merely possessing one as a key to a piece of data, means the data is somehow secured. This is false, in two senses — the first sense, not so critical; the second, very critical.
Before that, let’s look at how they might be used. I’ll use URLs / webpages as an example, because it’s a common application.
Let’s say I’ve written my own blogging system. The text of each article is stored in a database table, and the primary key of the table uniquely identifies my blog posts. When I list my blog posts so people can click on them, the URL might look like this:
The unique identifier — in this example, the number 17 — forms part of the URL, and someone can look at all my blog posts merely by changing the number after the equals sign. It’s ok, it’s only a blog; the worst they can do is put in a number of a post I haven’t published yet. Depending on how I’ve written my blogging engine, they’ll either see an unfinished post, or (preferably) a message saying there’s currently no live article with that id.
In the bad old days of internet website development, lesser developers used integer IDs to refer to data more sensitive than a blog post:
(Of course, I’m making these URLs up!) The website might ask you to log into some part of the site via a secure method, but once you were in, you’d be able to view the records of anyone on the system! Clearly a terrible thing, but this type of security hole has always existed; here’s an example from a few years ago, concerning Apple’s password reset process.
But it’s ok, we know about GUIDs now! GUIDs to the rescue! If the URL looks like this:
, and the website retrieves each patient’s private information via a GUID, then we’re perfectly safe, no?
No, not safe (sense 1)
In the sense that a GUID is ‘hard for a human to guess’, yes, you’re pretty safe. No-one is going to randomly type in some digits and hit upon a valid GUID. BUT:
1. What if it’s not someone typing? What if it’s a piece of software making thousands of guesses per second? What if it’s a distributed botnet, millions of pieces of software each making thousands of guesses per second? * And your site has millions of users? The probability of making at least one correct guess would not be negligible.
[* Presuming your server can cope with all those requests…]
2. What if your GUID isn’t all that random? There are people out there who, given enough examples, can derive information about your server’s random number generator, and make accurate guesses about historical and future GUIDs. The situation is worse still if you’ve used sequential GUIDs; and potentially catastrophic if the developer who’s written the code hasn’t really understood the point of them (see previous post).
Point 1 can be addressed by putting checks in place for incorrect guesses. Modern server set-ups will allow you to block IP addresses after unusual patterns of requests have been detected.
Point 2 is harder to address: how truly random is the code that generates your GUIDs? In practice, you just don’t know. The study of random number generation is an entire academic discipline of study, you could devote the rest of your life to assessing various random number generators!
But the above doesn’t matter because…
No, not safe (sense 2)
This is the crux of the matter: obscurity (hiding something) is not security. All it does is make something harder to locate, not impossible. Just because you can’t guess a GUID, it doesn’t mean there aren’t other ways of obtaining them, e.g.:
- Leaking (e.g. via accidental email forwarding)
- Social engineering
- Mistakes (e.g. devs ‘temporarily’ storing them in webpage HTML)
The point is that proper security needs to be applied on top, in all cases. Where sensitive information is concerned, people should be logging in securely, with as robust a system as it’s possible to use. So even if a bad person stumbles across information (e.g. GUIDs) they shouldn’t have, they can’t use it, because they don’t have the requisite access.
To sum up, using a GUID on its own is nowhere near adequate-enough security, there’s much more to be done. A final tip: Given that there’s no such thing as “100% secure”, your goal should be to record all user activity on your site / app, and dive into it regularly to check for breaches or unusual patterns. As a bonus, you get to see how users are really using your software — I promise there’ll be some surprises in there!
In consumer lending, there are many constraints that have to be satisfied before any money is paid out to a customer. Off the top of my head, the customer must be:
- In employment (not always, but usually)
- Over 18 (and under some policy-defined upper age limit)
- Not bankrupt or otherwise insolvent
- Accepted by the credit risk scorecard
- Not a PEP (Politically Exposed Person)
- Not an SDN (Specially Designated National), someone the government doesn’t want us to do business with
- Not suffering from mental health problems, or otherwise vulnerable
- Not matched by our own ‘do not lend’ list – i.e. they don’t match (or have links to) previous frauds or undesirable customers
- Accepted by the fraud risk scorecard
- Not suffering from an addiction to gambling
- Not currently in prison
- Thought to be telling the truth about their financial situation
(The exact list isn’t important here.)
Some of these statuses are easier to determine than others, but it’s impossible to get any of them 100% correct. Credit files are certainly imperfect, and the data they contain is always going to be out-of-date to some extent; PEP/SDN lists are pretty fuzzy; and of course, some people lie about their financial situation in order to get a loan paid out.
For the sake of demonstration, let’s assume we’re 99% accurate for each one: we get it wrong one time in a hundred. Given that error rate, what’s the probability of getting the overall decision wrong — that is, not satisfying every constraint — and paying out to someone we shouldn’t? Simple probability tells us it’s one minus the probability of getting all of them right, so for the 13 constraints above, it’s 1 – (0.99 ^ 13) = 12.2%. In other words, we’ve a one in eight chance of paying out when we shouldn’t; the consequences of which could be loss of money, regulatory or legal issues, damage to reputation, etc.
You could break some of those constraints into smaller parts: e.g. affordability is a calculation that relies on both income and expenditure being calculated correctly. Hence, satisfying those 13 constraints is down to many more smaller decisions (a scorecard involves hundreds of calculations). In which case, our situation is even worse: at our 99% error rate, it only takes 69 decisions to make it more likely that our overall decision is wrong! (1 – (0.99 ^ 69) = 50.02%.)
Luckily, the real-life situation isn’t this bad:
- For some decisions we’re far better than 99% accurate, e.g. “aged over 18” is reasonably simple to get right
- We test our scorecards thoroughly (yes, we do).*
- We corroborate information by obtaining it from several sources
- The various decisions aren’t all independent (our calculation assumes they are)
* Note that we’re not concerned here with scorecard predicting correctly – just functioning correctly, in accordance with how it was built.
To be honest, I’ve never estimated what the actual “incorrect lending decision” rate might be (my gut feeling is it’s satisfactorily low enough), but I’ve put it on my list of things to consider. And, as ever, I know that I won’t have a hope of getting an accurate answer if the data isn’t readily available and undistorted.
Following on from the previous post, here’s my own recipe for a people-matching system. I don’t go into all the gory technical details – the post would be five times the length – but I hope there’s enough here to get the theme across.
Let’s assume a few things:
- We have a large number of Person records (millions?), all from the UK.
- We believe that a significant proportion of these records are duplicates or redundant, as they represent the same people; but the details may differ.
- We only want to identify the records that match; we’re not looking to build a single all-encompassing representation of a person, wherein data from one record ‘enriches’ another. (More to come on that subject in a future post.)
- The quality of data in each Person record varies, maybe markedly; some records may be customers we have a long-standing relationship with, and are very confident of their details. But other records might come from less trusted or unverified sources, e.g. a survey on the website that promises a voucher in return for answering some questions and entering a few personal details; these details would clearly be less reliable.
- Our records have the following details (‘data points’):
- First name
- Last name
- Date of birth
- Mobile phone number
- Email address
, plus a unique PersonID. We might have other data points available, e.g. home telephone number, employment details, bank account number / sort code, and we could absolutely use them for matching. For now, we’ll concentrate on the six above.
- Every Person record only has one set of details, that is, just the six pieces of information that we’ll use for matching. This restriction isn’t that realistic – people can have more than one mobile number or email address, and they move house, and occasionally change their names. Taking this into account in a system doesn’t significantly alter the structure of what’s being proposed here.
For each one of these data points, we want to generate and store what is known as the canonical version: that is, the version that is as standard as we can possibly make it. For names, the simplest thing to do is remove spaces, hyphens and apostrophes, which will help us match “Peter O’Toole” against “Peter Otoole”, or “Andrew Lloyd Webber” against “Andrew Lloyd-Webber”. Dates of birth don’t need changing, as long as they’re being stored in a sensible format (so in SQL we should be using DATE, and not VARCHAR). Phone numbers might be accepted in our system as “+442077319333” or “020 7731 9333”, but we need to translate them into one format – I’d go for the former.
Email addresses: As mentioned in the the previous post, email domains can be synonyms of each other, e.g. gmail.com and googlemail.com. Depending on your data, making domains canonical might be too much effort. Also, with some providers (including gmail) the local part of the address (the left-hand side of the @ symbol) can contain labels, so you’ll need to strip those out to make the canonical version.
Addresses: Making canonical versions of bricks-and-mortar addresses can be the biggest headache, which is why you’ll need an address validator: a piece of software that takes an unformatted address, and turns it into a formatted address. Examples of address validators include PCA Predict (formerly Postcode Anywhere) and Experian Data Quality (formerly QAS). The way they work is that if we present, e.g.
10 SW1A 2AA
10 Downing Street City Of Westminster LONDON
to the validator, then we’ll get the same result back:
10 Downing Street LONDON SW1A 2AA
If you pay for the extra data, then you can get the UDPRN (Unique Delivery Point Reference Number), which maps every postal address into an 8-digit code. If not, you can make your own loose version of it by hashing the text of the validated address, which you can then use for comparing.
From this point on, I’ll assume (a) that all our data points are canonical, and (b) there are no known ‘test’ people within our data – if there are, we should remove them from any matching process.
Excluding common data points
Next, we’ll need to consider generating lists of common and/or dummy data. For example, what are the most popular mobile numbers in the data?
In SQL, our query would look something like this:
SELECT TOP 20 Mobile, COUNT(1) AS Total FROM dbo.Person WHERE Mobile IS NOT NULL GROUP BY Mobile HAVING COUNT(1) > 1 ORDER BY COUNT(1) DESC
If your data comes to you via third parties (e.g. affiliates, introducers, brokers), I should imagine you’ll see something like:
+447000000000 +447777777777 +447123456789 +447111111111 ...
with some surprisingly large counts against each. You clearly can’t use these mobile numbers for matching; in fact, you should look at all the associated Person records, to see if it’s worth excluding everyone with obvious dummy numbers like these. But you’ll probably just want to exclude this data point from the match, not exclude the entire Person record.
Similarly, you’ll need to do the same for email addresses, e.g. email@example.com, firstname.lastname@example.org, etc.; and home telephone numbers, if you capture them. (As mentioned previously, when faced with a home telephone number field, some people like to enter phone numbers for their local Chinese restaurants, taxi firms and pubs.)
It’s not as necessary to identify common names and addresses, although you might see a few Joe Bloggs, John Smiths and David Camerons. [Brief aside: it’s always useful to have a table containing the headline details of famous people, so you can stop them getting into your system at the point of entry. It’s unlikely you’ll end up with the Prime Minister’s mobile number or email address in your database – but his name and address are no secret!]
We’re at the point where our data is as standard (canonical) as we can make it, and we have lists (tables) of data points we know we’re NOT going to be matching.
The next step is to ask: Given two sets of details, by what criteria would we confidently say “yes, these two people are the same”? A complete match would be:
First name, Last name, Date of birth, Address*, Mobile, Email, Home phone
, i.e. every single piece of information – but that’s the most obvious case, and there are certainly subsets of these details that we’d be happy to accept.
(* Address meaning either a UDPRN or a hashed version of the text data returned from the validator.)
If the details of Person A matched those of Person B, for any set of the following details, I’d consider the match to be a hard match:
- Set 1: First name, Last name, Postcode, Mobile, Email
- Set 2: First name, Last name, Address, Mobile
- Set 3: First name, Last name, Address, Email
- Set 4: Last name, Date of birth, Postcode, Mobile, Email
- Set 5: First name, Last name, Date of birth, Address
- Set 6: Date of birth, Address, Mobile
- Set 7: Date of birth, Address, Email
where Set 1 is the best, and Set 7 the ‘least best’. (You, and your data, of course may disagree.) So, every single piece of information in a set has to be identical between person A and person B for us to consider it a match.
Note that no set of details is a subset of any other; the larger set of details would be redundant.
Remember that, if at all possible, our criteria shouldn’t match the following people:
- Parents and children with the exact same name
- Older relatives borrowing mobile numbers and email addresses from other family members
For each Person record, we’ll make (up to) 7 separate hashes, one for each set of details, and store them in a new table. An MD5 hash takes up 16 bytes, so if we had a million records, it’s a maximum of 150MB in space (if we include numeric IDs for the Person and for the set of details).
Finding hard matches between Person records then becomes as simple as finding hashes with more than one PersonID entry in our table – a simple GROUP BY, which if indexed properly, will be lightning quick.
(NB: If we’ve decided that a piece of information in a Person record isn’t suitable for matching, due to being too common or an obvious dummy, then we can’t generate any hashes that include that piece of information; we can’t just put a blank in, the match would be too loose. That’s why it’s “(up to) 7” – not “7” – hashes.)
What about the rest of our data? We’ve three options:
- Do nothing. Depending on our requirements, this might be ok; in the grand scheme of things, what’s a handful of duplicate emails / letters / phone calls?
- Manual matching. Again, depending on our situation, this might be perfectly feasible. If there’s any human intervention (for example, a phone call) in our end-to-end process, then it could be straightforward to give your customer-facing employees a screen that says “We’ve found these sets of details that might be the same person, do you agree?”, then let them approve or reject the match.
- Scoring. Give various amounts of points to each part of the match criteria, depending on the strength of match between data points. If you like, you can consider this a soft match (cf. the hard matching above).
The second and third options require that you’ve some reason to believe a match might exist in the first place, e.g. an email address or mobile number in common; or a combination of date of birth and postcode. With the right indexes on the data, it’s trivial to generate these candidate lists quickly. (Of course, you don’t need to attempt to match Person records you’ve already matched against each other.)
The goal is to come up with a scoring mechanism such that likely matches get more ‘points’ than unlikely ones. Outside of some Very Hard Computer Science (probably involving neural networks), I don’t know of any standard ways of generating scorecards where the outcome is unknown – if the outcome is known, then standard methods apply (Google: Building a scorecard.) However, for this particular application, common sense should get us where we want to go.
For string data like names (first or last), we can make an educated guess at a scoring mechanism, e.g.
- Full match: 10 points
- One letter out: 9 points
- Name A is contained within B (or vice-versa), e.g. “Carter” would match “Knowles-Carter”: 7 points
- First n letters of A = First n letters of B, where 1 ≤ n ≤ 4: n points
- Last n letters of A = Last n letters of B, where 2 ≤ n ≤ 5: n-1 points
- Both names are longer than 5 letters, and have 2 letters incorrect: 2 points
I don’t know that the above is optimal, but it feels like a good starting point.
How do we assess how far apart strings are? We need a function that, given two strings, calculates some notion of ‘distance’ between them. Such functions already exist, and the Levenshtein distance (the number of insert/replace edits needed to turn one string into another) is a popular choice.
For non-string data like dates of birth, we’d start with something like:
- Full match: 8 points
- One day out: 7 points
- Month / day swapped: 6 points
- One month out: 5 points
- One year out : 4 points
- Day and month correct: 3 points
etc. I’m sure you get the idea: the point is we have lots of options!
Hence, we build up a scorecard that takes two sets of details and returns a single number, a score. We can apply this scoring function to every pair of details in our candidate list, and generate a score for all matches. Inspecting these matches by eye, it becomes fairly obvious what a good threshold for acceptance would be, and what changes to our scorecard are needed. Within a few iterations, we should have a scorecard and threshold that give us the balance we need between matching correct sets of details (where correctness is assessed by eye, by manually checking a sample), and not matching incorrect sets of details.
The scorecards can get as complicated as you like: you can award extra points for an uncommon domain in the email addresses, or for having the same post town, or the same home phone number area code; you can subtract points for too many repeated digits in the mobile numbers, or having 1st January as the date of birth, or having a common name (‘John Smith’). The law of diminishing returns applies in spades – you have to determine whether the extra effort is warranted.
One very important thought: you MUST record exactly how a match between two sets of details was made. Was it a hard match? If so, which one? If the match was via a scorecard, which scorecard was it? (You’ll end up with more than one.) What was the score? What was the threshold that it passed? If you don’t record this information, you can’t hope to reliably improve on the matching process in future.
I’ve not finished on this subject yet, but hopefully this article gives you some ideas about building your own person-matching system, should you require one. I’ve yet to cover:
- Generating single person objects from matches
- How you practically use such objects within your applications
I’ll come back to these topics soon.
As ever, if you’ve any comments, I’d love to hear them!