Posts Tagged coding
As I wrote about here, I like to pass structured data around using XML (where applicable). While I think it’s the most flexible way, it’s understandable that people might not want to learn XML-parsing syntax, if they don’t have to. Another way of passing data into stored procedures (sprocs) is to use user-defined types (UDTs).
Here’s a simple example:
-- If the UDT already exists, delete it: IF EXISTS ( SELECT * FROM sys.types WHERE is_table_type = 1 AND [name] = 'ListOfDates' AND [schema_id] = SCHEMA_ID('dbo') ) BEGIN DROP TYPE dbo.ListOfDates END GO -- Create our UDT: CREATE TYPE dbo.ListOfDates AS TABLE ( DateTypeID TINYINT NOT NULL ,[Date] DATE NOT NULL ) GO -- Let's test it out: SET DATEFORMAT YMD DECLARE @MyListOfDates AS dbo.ListOfDates INSERT @MyListOfDates(DateTypeID, [Date]) VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03') ,(2, '2016Oct10'),(2, '2017Jan01') SELECT * FROM @MyListOfDates GO DateTypeID Date ---------- ---------- 1 2016-01-01 1 2016-01-02 1 2016-01-03 2 2016-10-10 2 2017-01-01
Hopefully, that looks straightforward; the way we’ve used it here is not dissimilar to a table variable (e.g. DECLARE @MyTable AS TABLE...), but we can’t pass a table variable to a sproc. We’ll create a test sproc and pass our new user-defined type to it:
-- If the sproc already exists, delete it: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create our sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates AS dbo.ListOfDates READONLY ) AS BEGIN SELECT DateTypeID ,MIN([Date]) AS MinDate ,MAX([Date]) AS MaxDate FROM @ListOfDates GROUP BY DateTypeID ORDER BY DateTypeID END GO -- Test it out: DECLARE @MyListOfDates AS dbo.ListOfDates INSERT @MyListOfDates(DateTypeID, [Date]) VALUES(1, '2016Jan01'),(1, '2016Jan02'),(1, '2016Jan03') ,(2, '2016Oct10'),(2, '2017Jan01') ,(3, '2017Feb01'),(3, '2017Feb03'),(3, '2017Feb28') EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates GO DateTypeID MinDate MaxDate ---------- ---------- ---------- 1 2016-01-01 2016-01-03 2 2016-10-10 2017-01-01 3 2017-02-01 2017-02-28
See the READONLY in the CREATE PROCEDURE? Here’s what happens when you omit it:
The table-valued parameter "@ListOfDates" must be declared with the READONLY option.
This is one of the limitations of user-defined types: they can’t be used to pass data back. So we couldn’t make any changes to @ListOfDates in the sproc, and see those changes reflected outside of the sproc.
There’s another limitation: UDTs can’t be used cross-database. Here’s what happens:
The type name 'MyOtherDatabase.dbo.ListOfDates' contains more than the maximum number of prefixes. The maximum is 1.
Even if you create the exact same type in another database, with the same name, it won’t work.
(Note that UDTs don’t have to be tables: go here [MSDN] for more info.)
Just for kicks, here’s how I’d replicate the above functionality, but using XML:
-- If the sproc already exists, delete it: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create the sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates XML ) AS BEGIN WITH cte_ExtractFromXML AS ( SELECT DateTypeID = d.i.value('(@type)', 'INT') ,[Date] = d.i.value('.', 'DATE') FROM @ListOfDates.nodes('//date') d(i) ) SELECT DateTypeID ,MIN([Date]) AS MinDate ,MAX([Date]) AS MaxDate FROM cte_ExtractFromXML GROUP BY DateTypeID ORDER BY DateTypeID END GO -- Test it with some XML data: DECLARE @MyListOfDates XML SET @MyListOfDates = ' <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates>' EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates GO DateTypeID MinDate MaxDate ----------- ---------- ---------- 1 2016-01-01 2016-01-03 2 2016-10-10 2017-01-01 3 2017-02-01 2017-02-28
If we wanted to, we could declare our @ListOfDates parameter as OUTPUT, and make changes to it in the sproc, e.g.:
Drop the sproc if it exists: IF OBJECT_ID('dbo.usp_MySproc') IS NOT NULL BEGIN DROP PROCEDURE dbo.usp_MySproc END GO -- Create our (new and improved) sproc: CREATE PROCEDURE dbo.usp_MySproc ( @ListOfDates XML OUTPUT ) AS BEGIN DECLARE @Aggregated XML ;WITH cte_ExtractFromXML AS ( SELECT DateTypeID = d.i.value('(@type)', 'INT') ,[Date] = d.i.value('.', 'DATE') FROM @ListOfDates.nodes('//date') d(i) ) SELECT @Aggregated = ( SELECT DateTypeID AS '@id' ,MIN([Date]) AS '@MinDate' ,MAX([Date]) AS '@MaxDate' FROM cte_ExtractFromXML GROUP BY DateTypeID ORDER BY DateTypeID FOR XML PATH('DataType'), ROOT('Aggregated') ) SELECT @ListOfDates = ( SELECT @ListOfDates, @Aggregated FOR XML PATH('Output'), TYPE ) END GO -- Run it: DECLARE @MyListOfDates XML SET @MyListOfDates = ' <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates>' EXEC dbo.usp_MySproc @ListOfDates = @MyListOfDates OUTPUT SELECT @MyListOfDates AS MyListOfDates GO <Output> <listOfDates> <date type="1">2016Jan01</date> <date type="1">2016Jan02</date> <date type="1">2016Jan03</date> <date type="2">2016Oct10</date> <date type="2">2017Jan01</date> <date type="3">2017Feb01</date> <date type="3">2017Feb03</date> <date type="3">2017Feb28</date> </listOfDates> <Aggregated> <DataType id="1" MinDate="2016-01-01" MaxDate="2016-01-03" /> <DataType id="2" MinDate="2016-10-10" MaxDate="2017-01-01" /> <DataType id="3" MinDate="2017-02-01" MaxDate="2017-02-28" /> </Aggregated> </Output>
As you can see, we’ve returned our original list, along with the aggregated data, both under a new parent node.
In my previous post, I wrote about GUIDs (aka UUIDs), marvellous datatypes that allow you to uniquely identify a piece of information. Because the probability of randomly generating two the same is so small, there’s a sense that merely possessing one as a key to a piece of data, means the data is somehow secured. This is false, in two senses — the first sense, not so critical; the second, very critical.
Before that, let’s look at how they might be used. I’ll use URLs / webpages as an example, because it’s a common application.
Let’s say I’ve written my own blogging system. The text of each article is stored in a database table, and the primary key of the table uniquely identifies my blog posts. When I list my blog posts so people can click on them, the URL might look like this:
The unique identifier — in this example, the number 17 — forms part of the URL, and someone can look at all my blog posts merely by changing the number after the equals sign. It’s ok, it’s only a blog; the worst they can do is put in a number of a post I haven’t published yet. Depending on how I’ve written my blogging engine, they’ll either see an unfinished post, or (preferably) a message saying there’s currently no live article with that id.
In the bad old days of internet website development, lesser developers used integer IDs to refer to data more sensitive than a blog post:
(Of course, I’m making these URLs up!) The website might ask you to log into some part of the site via a secure method, but once you were in, you’d be able to view the records of anyone on the system! Clearly a terrible thing, but this type of security hole has always existed; here’s an example from a few years ago, concerning Apple’s password reset process.
But it’s ok, we know about GUIDs now! GUIDs to the rescue! If the URL looks like this:
, and the website retrieves each patient’s private information via a GUID, then we’re perfectly safe, no?
No, not safe (sense 1)
In the sense that a GUID is ‘hard for a human to guess’, yes, you’re pretty safe. No-one is going to randomly type in some digits and hit upon a valid GUID. BUT:
1. What if it’s not someone typing? What if it’s a piece of software making thousands of guesses per second? What if it’s a distributed botnet, millions of pieces of software each making thousands of guesses per second? * And your site has millions of users? The probability of making at least one correct guess would not be negligible.
[* Presuming your server can cope with all those requests…]
2. What if your GUID isn’t all that random? There are people out there who, given enough examples, can derive information about your server’s random number generator, and make accurate guesses about historical and future GUIDs. The situation is worse still if you’ve used sequential GUIDs; and potentially catastrophic if the developer who’s written the code hasn’t really understood the point of them (see previous post).
Point 1 can be addressed by putting checks in place for incorrect guesses. Modern server set-ups will allow you to block IP addresses after unusual patterns of requests have been detected.
Point 2 is harder to address: how truly random is the code that generates your GUIDs? In practice, you just don’t know. The study of random number generation is an entire academic discipline of study, you could devote the rest of your life to assessing various random number generators!
But the above doesn’t matter because…
No, not safe (sense 2)
This is the crux of the matter: obscurity (hiding something) is not security. All it does is make something harder to locate, not impossible. Just because you can’t guess a GUID, it doesn’t mean there aren’t other ways of obtaining them, e.g.:
- Leaking (e.g. via accidental email forwarding)
- Social engineering
- Mistakes (e.g. devs ‘temporarily’ storing them in webpage HTML)
The point is that proper security needs to be applied on top, in all cases. Where sensitive information is concerned, people should be logging in securely, with as robust a system as it’s possible to use. So even if a bad person stumbles across information (e.g. GUIDs) they shouldn’t have, they can’t use it, because they don’t have the requisite access.
To sum up, using a GUID on its own is nowhere near adequate-enough security, there’s much more to be done. A final tip: Given that there’s no such thing as “100% secure”, your goal should be to record all user activity on your site / app, and dive into it regularly to check for breaches or unusual patterns. As a bonus, you get to see how users are really using your software — I promise there’ll be some surprises in there!
If you know what GUIDs are, please click here to skip to the part where I talk about using them in databases.
GUIDs (Globally Unique Identifiers, aka UUIDs) are simply a string of 32 random hexadecimal numbers (that is: characters 0-9 and A-F), separated into five groups by dashes.
Here’s a GUID:
All modern languages are able to generate them; here’s how I generated it in SQL:
SELECT NEWID() GO ------------------------------------ C2DFE25B-6C1B-46B3-9497-DA45EF76D994 (1 row(s) affected)
A GUID is simply a big random number, presented in a human-readable form. How big is ‘big’? With 32 hex digits, it means a GUID can take any of 16^32 = 2^128 values. (2^128 is approximately 3.4 x 10^38)
GUIDs are big. They’re so big, that you could label every atom in the universe using just 3 GUIDs. In fact, it’d be massive overkill: 3 GUIDs have a potential 2^384 values between them, which is equal to 3.940201 x 10^115; the number of atoms in the universe is estimated at 10^82, many orders of magnitude less.
Because GUIDs can take such an enormous range of values, the chances of generating a duplicate are minuscule. Quote:
“In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.” (wikipedia)
The ‘U’ in GUID basically means ‘unique for all practical purposes you’re likely to ever be involved with’ (unless you work for CERN, in which case I take it back).
So, that’s great: we have this construct that for all intents and purposes is unique (and I’ll treat it as such from here on), and we can generate one any time we want one. But how are they used?
The most common usage of GUIDs is as keys for referring to other pieces of information, especially a block of structured information. For example, when I request a customer’s credit file, there’s a GUID, right near the top of the file. If I need to refer to that credit file again (whether inside my organisation, or with the issuing bureau), I can refer to it by the GUID, and we all know exactly which file I mean — not just the customer/address it refers to, but the data as it stood at that point.
Now, database tables need a primary key to identify each row – and, by definition, the value of the key has to be unique. So it would seem a natural thing to want to have a GUID as a primary key. Even better: not only will we ensure that every row in our table will be unique, but every row in every table can be uniquely identified, in every database in the world! And you don’t even need to request a GUID from your database server when you create the data for a row, you can pre-generate primary keys in your C# code, and use them before they ever need to be stored on the server!
Sounds too good to be true, so what’s the catch?
First off, most developers, analysts (and even DBAs) talk about ‘primary keys’ when they mean clustering keys – often, they’re the same piece of information, but they absolutely don’t have to be. The primary key is the piece of data that uniquely identifies a row in a table. The clustering key is the piece of data that determines the order of the data when it’s stored (on disk). More often than not, a straightforward incrementing integer (1,2,3…) can do the job of both, but it’s an informed choice that the database developer should be making.
When the clustering key is an incrementing integer, organising the data on disk is easy: the data goes in the next available slot. But when it’s (effectively) a random number, where does it go? The database has to make guesses about how much data there’s likely to be; guesses that it’ll have to re-assess every time a new row needs to be INSERTed into the database – worst case, it’s re-organising the data on disk every few INSERTs. This is really inefficient, and causes unnecessary stress on your server.
Internally in SQL Server, GUIDs take up 16 bytes of space, compared to the 4 bytes of an INT, or 8 bytes of a BIGINT. That’s not a major issue, unless you have lots of indexes on your table: indexes on tables automatically contain the clustering key, so with a GUID clustering key, every single index defined on that table will also contain the GUID. Potentially lots of valuable space used up, if you’re not careful.
Ok, let’s list some bad points about GUIDs as clustering keys:
- They cause inefficiencies under the hood: the server can’t make it’s usual good guesses about where to store data. NB: There is such a thing as a SEQUENTIAL GUID (Info here at MSDN), which lessens the impact – personally, I still wouldn’t bother.
- They take up four times more space than traditional INTs, which could be a problem if you have lots of indexes.
- Table JOINs are slower; SQL Server is optimised for joining tables together via simple integers.
There’s another (very important) reason not to use them that people tend to overlook: it makes debugging and tracking down errors incredibly painful! Incrementing numbers are intuitive, easy to memorise (if they’re small enough), easy to compare (“x+z” happened after “x”)… but GUIDs are just a ‘blob’ of data, there’s nothing intuitive about them.
How to use GUIDs, pain-free
It’s simple: add a GUID as a normal column and index it!
ALTER TABLE dbo.Person ADD COLUMN PersonUID UNIQUEIDENTIFIER NULL GO -- ...UPDATE the table to fill PersonUID here ... CREATE NONCLUSTERED INDEX IX_Person_PersonUID ON dbo.Person(PersonUID) GO
That’s as complex as it needs to be.
Sad note: In the past, I’ve had developers add GUIDs that looked like this:
00000000-0000-0000-0000-000000000001 00000000-0000-0000-0000-000000000002 00000000-0000-0000-0000-000000000003 ...
, thus completely missing the point of using a GUID in the first place. If you’re not auto-generating the GUIDs in-database, make sure to check devs aren’t putting daft data in. (Obviously something you should do anyway!)
Finally, we have our highly unique piece of information, stored in the best way possible. So now we can use that in a URL on a public-facing website to reference people’s data securely, no? Sorry, that’s not quite right — a little issue of this thing called Security Through Obscurity, which I’ll write about next time.
I’m going to spend a few posts on a subject that is, frankly, the bane of my life: people matching. That is, given two sets of person-related details, do I believe they are the same person? It’s eminently useful for many things including keeping marketing costs down, improving customer service, and very importantly, preventing fraud.
If your dataset(s) contain a unique person key, e.g. Social Security Number in the USA, or National Insurance Number here in the UK, then the task is obviously pretty simple (barring errors in the data). If there’s no unique person key, you’ve got a great deal more work to do. I’d say it follows a 95 / 5 rule: to match the first 95% of your dataset takes 5% of the time, the 5% that’s left takes the remaining 95% of the time. (Hence why it causes me grief: you can end up writing reams of code to match a handful of details, in a never-ending quest for greater accuracy!)
Before I start discussing how I’d do people matching in a “perfect world” scenario, I’m going to list some of the problems I’ve encountered when trying to match data from UK sources.
- Shortened or alternative forms of the first name: e.g. Bill / William, Peggy / Margaret, Jack / John. And these days, Alfie probably isn’t short for Alfred, just as Harry probably isn’t short for Harold (or even a boy’s name).
- As per the above, I wouldn’t ever assume a particular first name implies a gender; you’ll be wrong at some point, and an awkward conversation might ensue.
- Using middle names as first names; famous examples include Hannah Dakota Fanning, William Bradley Pitt, Walter Bruce Willis, James Paul McCartney, Laura Jeanne Reese Witherspoon.
- Married names, people taking their spouse’s last name, without any restrictions on gender.
- Double-barrelling last names with spouse or partner.
- Very common names – names like ‘George Smith’ and ‘Claire Wilson’ mean placing more reliance on other pieces of information when matching.
- In my experience, Mr/Ms/Miss/Mrs etc. are rarely correct enough to rely on to indicate gender or married status*, even when the primary source is data the customer has entered themselves. Also, the gender-neutral Mx is becoming increasingly common.
- Let’s not even get into the realms of Professor, Doctor, Lord/Lady, Reverend and assorted military titles…
* Using gender and married status purely as aids to matching people, nothing else.
Dates of birth
It’s very easy to get the date of birth wrong with mis-typing, or getting the month and day the wrong way round. Also, people (a) don’t like to give their birthdate out, so may give a dummy one (1st Jan 1970 is common), or (b) will lie about their age if they think it improves their chances of obtaining a product or service.
People with “non-traditionally British” names
- People from other countries adopting a Western-style first name alongside their traditional birth-name (e.g. Chinese people).
- First names / family names may not be in the expected order (again, e.g. Chinese).
- Names that have more than one translation into English, e.g. Mohammed / Muhammad / Mohamed.
- Different character sets! Greek, Cyrillic, Arabic, etc.
(“Non-traditionally British” is an ugly turn of phrase, there must be a better way of putting it…)
- Fathers and sons with exactly the same first, middle and last names. (Far more common than you’d think!)
- Twins; especially twins with very similar first names (Mia/Mya, Ethan/Evan).
- You can’t reliably infer relationships using only differences in age; two customers from the same family, 32 years apart, could potentially be siblings, parent/child, or even grandparent/grandchild.
- Living at more than one address; in particular, students living away from home.
- Moving house, sometimes within the same postcode, or even next door.
- Postcodes not existing yet on the Postcode Address File, although you may find them on Google Maps(!)
- Postcodes becoming invalid / retired, e.g. postcodes in the districts BS12, BS17-19.
- Postcodes becoming valid: the district E20 was previously used only for the fictional TV soap Eastenders, but postcodes in this district have now started to be allocated for real addresses.
- Roads can be renamed [BBC]
- Buildings can be split into flats.
- Different naming conventions; flats in Scotland can be named by floor number / flat number, e.g. 2/1 (2nd floor, 1st flat).
Some address-related problems can be solved by using the Unique Property Reference Number (UPRN) or the Unique Delivery Point Reference Number (UDPRN) to represent the address, but neither of these has widespread adoption yet.
- Having more than one email address.
- Labels, e.g. fred.smith+SPAM@mailbox.com and fred.smith+NOTSPAM@mailbox.com. The canonical version of the email address would be email@example.com, which may be more useful for matching purposes.
- Temporary inboxes, e.g. Mailinator.
- Format: Validating the syntax of 99% of email addresses is straightforward, getting the full 100% is almost impossible. See here [wikipedia] for a brief explanation about which characters are allowed in an email address.
- Having more than one mobile number.
- 070 ‘personal’ numbers
Home phone numbers
- Having more than one home phone number.
- Not having a phone number, but entering one belonging to a friend or relative.
- Not having a phone number, so using the number of a local taxi firm, public house, or fast-food restaurant (again, more common than you might think).
- Having more than one bank account
- People not moving their bank accounts when they move house. (I live 80 miles away from my nominal branch.)
- Sort codes changing, especially when banks merge or split.
- Joint bank accounts
- Business bank accounts
Debit and credit cards
You almost certainly shouldn’t be storing card details…! [www.theukcardsassociation.org.uk]
- Accidental mis-typing
- Deliberate fraud – typically, the name and address might be real, but the mobile and email will be the fraudster’s.
- System-testing : internal (dev or UAT environment) vs. external (penetration testing), manual/automated, regular (e.g. employees) / irregular (e.g. competitors testing capabilities; hackers!)
- Details not existing: some people don’t have home telephone numbers (so put their mobile number in that field instead), whereas other people don’t have mobiles (so they put their home number instead).
- People just messing around, possibly not-very-maliciously.
- Older people using younger family members’ email addresses and/or mobile numbers.
- People who work overseas and have non-UK mobile number and address; they could be a valid customer, as per your policies, but with only non-UK contact details. Do your systems accept a phone number that doesn’t start +44?
- Driving License / Passport : most existing systems only validate the format of the identifying numbers, which makes them a target for fraudsters. Newer systems can validate images of the documents.
- Device IDs are great for fraud detection, but can present a problem when matching people; families often share devices, and what about public computers in libraries and internet cafes?
- Electoral Roll: Being on the full electoral roll at an address is no guarantee that the person is living there, and the converse is also true.
Third-party services exist to validate/verify almost all the information above, singularly and together. However, none of the services are perfect, so matching person-level data comes down to cost (third party data and development time), and your tolerance for mistakes – how embarrassing might it be if you get it wrong?
If you have any examples of when matching personal details has proved trickier than you thought it was going to be, please let me know in the comments below!