Posts Tagged clustering keys
It’s very common in the world of website development to have the database built by application developers, rather than a dedicated SQL developer. Indeed, that’s how I started: building portals for traditional media companies, plus the usual B2C and B2B sites. As website developers, we were expected to create our own databases.
One piece of creaky SQL advice that’s often spread around as gospel amongst developers, is that tables require a primary key that’s a single piece of information: usually an incrementing integer, or (shudder) a GUID. Every row has to be able to be uniquely identified by the content of a single column! … But it’s not true. Admittedly, it’s rarely the worst thing wrong with a table, but it is wasteful and quite annoying.
Let’s take a simple enough example, say an HR database that keeps a record of people’s previous job history. We need the Person record (“John Smith”), a Role (“marketing manager”), and an Employer (“IBM”). The bare-bones tables will look something like this:
CREATE TABLE dbo.Person ( PersonID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_Person PRIMARY KEY CLUSTERED ,FirstName VARCHAR(50) NOT NULL ,LastName VARCHAR(50) NOT NULL ,DateOfBirth DATE NOT NULL ) CREATE TABLE dbo.[Role] -- << I've used square brackets because 'role' is a SQL Server keyword. ( RoleID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_Role PRIMARY KEY CLUSTERED ,RoleName VARCHAR(100) NOT NULL ) CREATE TABLE dbo.Employer ( EmployerID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_Employer PRIMARY KEY CLUSTERED ,EmployerName VARCHAR(100) NOT NULL )
(I’ve filled the tables with dummy data: 1 million Roles, 1 million Employers, 10 million Person rows)
We now need a way of linking this information together, with some start and end dates. In the back of our mind, we know the most common query is going to involve asking for the records by PersonID.
There are 3 logical options:
Option (1) is the standard developer way of doing it. The table has a clustered integer primary key, every row gets its own number:
CREATE TABLE dbo.PersonEmployment_1 ( PersonEmploymentID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_PersonEmployment_1 PRIMARY KEY CLUSTERED ,PersonID INT NOT NULL ,EmployerID INT NOT NULL ,RoleID INT NOT NULL ,StartDate DATE NOT NULL ,EndDate DATE NULL ,CONSTRAINT FK_PersonEmployment_1_PersonID FOREIGN KEY (PersonID) REFERENCES dbo.Person(PersonID) ,CONSTRAINT FK_PersonEmployment_1_EmployerID FOREIGN KEY (EmployerID) REFERENCES dbo.Employer(EmployerID) ,CONSTRAINT FK_PersonEmployment_1_RoleID FOREIGN KEY (RoleID) REFERENCES dbo.Role(RoleID) )
After a couple of weeks of use, we notice it’s running slowly, so some bright spark puts an index on:
CREATE NONCLUSTERED INDEX IX_PersonEmployment_1_PersonID_StartDate ON dbo.PersonEmployment_1(PersonID, StartDate)
But what is that PersonEmploymentID column actually telling us? It’s not meaningful at an application level – you’d never display it to anyone, it’s really only there so the developer is happy that they can find the correct row for UPDATE-ing or DELETE-ing.
Option (2) does away with PersonEmploymentID. For the purposes of our example, a person only has one job at a time, so a better clustering primary key is one formed of the two columns that uniquely identify that slice of employment history: the PersonID and the StartDate.
CREATE TABLE dbo.PersonEmployment_2 ( PersonID INT NOT NULL ,EmployerID INT NOT NULL ,RoleID INT NOT NULL ,StartDate DATE NOT NULL ,EndDate DATE NULL ,CONSTRAINT PK_PersonEmployment_2 PRIMARY KEY CLUSTERED (PersonID, StartDate) ... (foreign keys, same as before) )
If we want to UPDATE or DELETE a record, we have to use two pieces of information (PersonID and StartDate) to get the row we need; but that’s ok, it’s a few extra characters of typing in the application.
Option (3) is a hybrid of the two: we genuinely can’t live without our incrementing integer primary key (option 1), but we’ll give in, partially, and make the clustering key from the composite information, same as option (2).
CREATE TABLE dbo.PersonEmployment_3 ( PersonEmploymentID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_PersonEmployment_3 PRIMARY KEY NONCLUSTERED ,PersonID INT NOT NULL ,EmployerID INT NOT NULL ,RoleID INT NOT NULL ,StartDate DATE NOT NULL ,EndDate DATE NULL ... (foreign keys, same as before) ) CREATE UNIQUE CLUSTERED INDEX IXC_PersonEmployment_3 ON dbo.PersonEmployment_3(PersonID, StartDate)
Let’s fill all three tables with the same data: 10 million rows, one job for each Person.
How much disk-space does each option consume?
|Option||Data size (MB)||Index size (MB)||Total size (MB)|
Options 1 and 3 use basically the same amount; they take up 77% more space than option 2, due to the extra data they use.
Now let’s query the data (with STATISTICS turned on):
SELECT * FROM dbo.PersonEmployment_1 WHERE PersonID = 1234567 SELECT * FROM dbo.PersonEmployment_2 WHERE PersonID = 1234567 SELECT * FROM dbo.PersonEmployment_3 WHERE PersonID = 1234567
Option 1 needs 6 reads, and has to employ a bookmark lookup – that is, the query engine finds the PersonEmploymentID from the index, but has to go back to the main PersonEmployment table to get the rest of the data we asked for. Options 2 and 3 only need 3 reads, and of course no bookmark lookup – because the data is already in the order we need to be able to fulfill our query.
It should be clear by now that the best option for us here is Option 2. It takes up less space, and is quickest for our most common query. In summary, say no to useless primary keys!