Archive for October, 2016
If you know what GUIDs are, please click here to skip to the part where I talk about using them in databases.
GUIDs (Globally Unique Identifiers, aka UUIDs) are simply a string of 32 random hexadecimal numbers (that is: characters 0-9 and A-F), separated into five groups by dashes.
Here’s a GUID:
All modern languages are able to generate them; here’s how I generated it in SQL:
SELECT NEWID() GO ------------------------------------ C2DFE25B-6C1B-46B3-9497-DA45EF76D994 (1 row(s) affected)
A GUID is simply a big random number, presented in a human-readable form. How big is ‘big’? With 32 hex digits, it means a GUID can take any of 16^32 = 2^128 values. (2^128 is approximately 3.4 x 10^38)
GUIDs are big. They’re so big, that you could label every atom in the universe using just 3 GUIDs. In fact, it’d be massive overkill: 3 GUIDs have a potential 2^384 values between them, which is equal to 3.940201 x 10^115; the number of atoms in the universe is estimated at 10^82, many orders of magnitude less.
Because GUIDs can take such an enormous range of values, the chances of generating a duplicate are minuscule. Quote:
“In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.” (wikipedia)
The ‘U’ in GUID basically means ‘unique for all practical purposes you’re likely to ever be involved with’ (unless you work for CERN, in which case I take it back).
So, that’s great: we have this construct that for all intents and purposes is unique (and I’ll treat it as such from here on), and we can generate one any time we want one. But how are they used?
The most common usage of GUIDs is as keys for referring to other pieces of information, especially a block of structured information. For example, when I request a customer’s credit file, there’s a GUID, right near the top of the file. If I need to refer to that credit file again (whether inside my organisation, or with the issuing bureau), I can refer to it by the GUID, and we all know exactly which file I mean — not just the customer/address it refers to, but the data as it stood at that point.
Now, database tables need a primary key to identify each row – and, by definition, the value of the key has to be unique. So it would seem a natural thing to want to have a GUID as a primary key. Even better: not only will we ensure that every row in our table will be unique, but every row in every table can be uniquely identified, in every database in the world! And you don’t even need to request a GUID from your database server when you create the data for a row, you can pre-generate primary keys in your C# code, and use them before they ever need to be stored on the server!
Sounds too good to be true, so what’s the catch?
First off, most developers, analysts (and even DBAs) talk about ‘primary keys’ when they mean clustering keys – often, they’re the same piece of information, but they absolutely don’t have to be. The primary key is the piece of data that uniquely identifies a row in a table. The clustering key is the piece of data that determines the order of the data when it’s stored (on disk). More often than not, a straightforward incrementing integer (1,2,3…) can do the job of both, but it’s an informed choice that the database developer should be making.
When the clustering key is an incrementing integer, organising the data on disk is easy: the data goes in the next available slot. But when it’s (effectively) a random number, where does it go? The database has to make guesses about how much data there’s likely to be; guesses that it’ll have to re-assess every time a new row needs to be INSERTed into the database – worst case, it’s re-organising the data on disk every few INSERTs. This is really inefficient, and causes unnecessary stress on your server.
Internally in SQL Server, GUIDs take up 16 bytes of space, compared to the 4 bytes of an INT, or 8 bytes of a BIGINT. That’s not a major issue, unless you have lots of indexes on your table: indexes on tables automatically contain the clustering key, so with a GUID clustering key, every single index defined on that table will also contain the GUID. Potentially lots of valuable space used up, if you’re not careful.
Ok, let’s list some bad points about GUIDs as clustering keys:
- They cause inefficiencies under the hood: the server can’t make it’s usual good guesses about where to store data. NB: There is such a thing as a SEQUENTIAL GUID (Info here at MSDN), which lessens the impact – personally, I still wouldn’t bother.
- They take up four times more space than traditional INTs, which could be a problem if you have lots of indexes.
- Table JOINs are slower; SQL Server is optimised for joining tables together via simple integers.
There’s another (very important) reason not to use them that people tend to overlook: it makes debugging and tracking down errors incredibly painful! Incrementing numbers are intuitive, easy to memorise (if they’re small enough), easy to compare (“x+z” happened after “x”)… but GUIDs are just a ‘blob’ of data, there’s nothing intuitive about them.
How to use GUIDs, pain-free
It’s simple: add a GUID as a normal column and index it!
ALTER TABLE dbo.Person ADD COLUMN PersonUID UNIQUEIDENTIFIER NULL GO -- ...UPDATE the table to fill PersonUID here ... CREATE NONCLUSTERED INDEX IX_Person_PersonUID ON dbo.Person(PersonUID) GO
That’s as complex as it needs to be.
Sad note: In the past, I’ve had developers add GUIDs that looked like this:
00000000-0000-0000-0000-000000000001 00000000-0000-0000-0000-000000000002 00000000-0000-0000-0000-000000000003 ...
, thus completely missing the point of using a GUID in the first place. If you’re not auto-generating the GUIDs in-database, make sure to check devs aren’t putting daft data in. (Obviously something you should do anyway!)
Finally, we have our highly unique piece of information, stored in the best way possible. So now we can use that in a URL on a public-facing website to reference people’s data securely, no? Sorry, that’s not quite right — a little issue of this thing called Security Through Obscurity, which I’ll write about next time.