I’m going to spend a few posts on a subject that is, frankly, the bane of my life: people matching. That is, given two sets of person-related details, do I believe they are the same person? It’s eminently useful for many things including keeping marketing costs down, improving customer service, and very importantly, preventing fraud.
If your dataset(s) contain a unique person key, e.g. Social Security Number in the USA, or National Insurance Number here in the UK, then the task is obviously pretty simple (barring errors in the data). If there’s no unique person key, you’ve got a great deal more work to do. I’d say it follows a 95 / 5 rule: to match the first 95% of your dataset takes 5% of the time, the 5% that’s left takes the remaining 95% of the time. (Hence why it causes me grief: you can end up writing reams of code to match a handful of details, in a never-ending quest for greater accuracy!)
Before I start discussing how I’d do people matching in a “perfect world” scenario, I’m going to list some of the problems I’ve encountered when trying to match data from UK sources.
- Shortened or alternative forms of the first name: e.g. Bill / William, Peggy / Margaret, Jack / John. And these days, Alfie probably isn’t short for Alfred, just as Harry probably isn’t short for Harold (or even a boy’s name).
- As per the above, I wouldn’t ever assume a particular first name implies a gender; you’ll be wrong at some point, and an awkward conversation might ensue.
- Using middle names as first names; famous examples include Hannah Dakota Fanning, William Bradley Pitt, Walter Bruce Willis, James Paul McCartney, Laura Jeanne Reese Witherspoon.
- Married names, people taking their spouse’s last name, without any restrictions on gender.
- Double-barrelling last names with spouse or partner.
- Very common names – names like ‘George Smith’ and ‘Claire Wilson’ mean placing more reliance on other pieces of information when matching.
- In my experience, Mr/Ms/Miss/Mrs etc. are rarely correct enough to rely on to indicate gender or married status*, even when the primary source is data the customer has entered themselves. Also, the gender-neutral Mx is becoming increasingly common.
- Let’s not even get into the realms of Professor, Doctor, Lord/Lady, Reverend and assorted military titles…
* Using gender and married status purely as aids to matching people, nothing else.
Dates of birth
It’s very easy to get the date of birth wrong with mis-typing, or getting the month and day the wrong way round. Also, people (a) don’t like to give their birthdate out, so may give a dummy one (1st Jan 1970 is common), or (b) will lie about their age if they think it improves their chances of obtaining a product or service.
People with “non-traditionally British” names
- People from other countries adopting a Western-style first name alongside their traditional birth-name (e.g. Chinese people).
- First names / family names may not be in the expected order (again, e.g. Chinese).
- Names that have more than one translation into English, e.g. Mohammed / Muhammad / Mohamed.
- Different character sets! Greek, Cyrillic, Arabic, etc.
(“Non-traditionally British” is an ugly turn of phrase, there must be a better way of putting it…)
- Fathers and sons with exactly the same first, middle and last names. (Far more common than you’d think!)
- Twins; especially twins with very similar first names (Mia/Mya, Ethan/Evan).
- You can’t reliably infer relationships using only differences in age; two customers from the same family, 32 years apart, could potentially be siblings, parent/child, or even grandparent/grandchild.
- Living at more than one address; in particular, students living away from home.
- Moving house, sometimes within the same postcode, or even next door.
- Postcodes not existing yet on the Postcode Address File, although you may find them on Google Maps(!)
- Postcodes becoming invalid / retired, e.g. postcodes in the districts BS12, BS17-19.
- Postcodes becoming valid: the district E20 was previously used only for the fictional TV soap Eastenders, but postcodes in this district have now started to be allocated for real addresses.
- Roads can be renamed [BBC]
- Buildings can be split into flats.
- Different naming conventions; flats in Scotland can be named by floor number / flat number, e.g. 2/1 (2nd floor, 1st flat).
Some address-related problems can be solved by using the Unique Property Reference Number (UPRN) or the Unique Delivery Point Reference Number (UDPRN) to represent the address, but neither of these has widespread adoption yet.
- Having more than one email address.
- Labels, e.g. fred.smith+SPAM@mailbox.com and fred.smith+NOTSPAM@mailbox.com. The canonical version of the email address would be email@example.com, which may be more useful for matching purposes.
- Temporary inboxes, e.g. Mailinator.
- Format: Validating the syntax of 99% of email addresses is straightforward, getting the full 100% is almost impossible. See here [wikipedia] for a brief explanation about which characters are allowed in an email address.
- Having more than one mobile number.
- 070 ‘personal’ numbers
Home phone numbers
- Having more than one home phone number.
- Not having a phone number, but entering one belonging to a friend or relative.
- Not having a phone number, so using the number of a local taxi firm, public house, or fast-food restaurant (again, more common than you might think).
- Having more than one bank account
- People not moving their bank accounts when they move house. (I live 80 miles away from my nominal branch.)
- Sort codes changing, especially when banks merge or split.
- Joint bank accounts
- Business bank accounts
Debit and credit cards
You almost certainly shouldn’t be storing card details…! [www.theukcardsassociation.org.uk]
- Accidental mis-typing
- Deliberate fraud – typically, the name and address might be real, but the mobile and email will be the fraudster’s.
- System-testing : internal (dev or UAT environment) vs. external (penetration testing), manual/automated, regular (e.g. employees) / irregular (e.g. competitors testing capabilities; hackers!)
- Details not existing: some people don’t have home telephone numbers (so put their mobile number in that field instead), whereas other people don’t have mobiles (so they put their home number instead).
- People just messing around, possibly not-very-maliciously.
- Older people using younger family members’ email addresses and/or mobile numbers.
- People who work overseas and have non-UK mobile number and address; they could be a valid customer, as per your policies, but with only non-UK contact details. Do your systems accept a phone number that doesn’t start +44?
- Driving License / Passport : most existing systems only validate the format of the identifying numbers, which makes them a target for fraudsters. Newer systems can validate images of the documents.
- Device IDs are great for fraud detection, but can present a problem when matching people; families often share devices, and what about public computers in libraries and internet cafes?
- Electoral Roll: Being on the full electoral roll at an address is no guarantee that the person is living there, and the converse is also true.
Third-party services exist to validate/verify almost all the information above, singularly and together. However, none of the services are perfect, so matching person-level data comes down to cost (third party data and development time), and your tolerance for mistakes – how embarrassing might it be if you get it wrong?
If you have any examples of when matching personal details has proved trickier than you thought it was going to be, please let me know in the comments below!