Archive for August, 2015
Here’s a common-enough scenario: I have a fairly complex query, over multiple tables, where one or more of the component tables are large (tens of millions of rows) and with multiple indexes that are almost certainly not optimal for what’s needed. My query relates to a specific and well-defined date-range, but even though there are explicit indexes over the date columns, it still runs too slowly.
Given that tuning queries isn’t really my day job – I just need to get the data out so I can get on with analysing it – I tend to go for the quick win. In most cases, putting ranges on the integer primary keys can dramatically speed up the query. To this end, I keep several ‘milestone’ tables in the database, that look something like this:
StartDate ID ---------- ---------- 2015-01-01 20017996 2015-02-01 20843583 2015-03-01 21529164 2015-04-01 22231934 2015-05-01 22981850 2015-06-01 23723066 ... ...
(The primary key on these milestone tables will be a date – for me, it’s always the first of a month, but there’s no reason you couldn’t have weekly or daily tables.)
So now, either I look in the milestone table first and manually adjust the range on the primary key(s) for my query, or I’ll JOIN directly to the milestone table.
To keep the milestone tables updated, there’s a SQL Server job that runs just after midnight on the first of the month to INSERT the latest row. It looks something like this:
INSERT dbo.MyMilestoneTable ( StartDate ,ID ) SELECT StartDate = CAST(GETDATE() AS DATE) ,ID = MAX(t.MyTableID) FROM dbo.MyTable t JOIN dbo.MyMilestoneTable m ON m.StartDate = DATEADD(month, -1, CAST(GETDATE() AS DATE)) WHERE t.MyTableID > m.ID AND t.DateCreated < CAST(GETDATE() AS DATE) GO
And that’s it, that’s all there is to them. There’s also a nice side-effect: it’s pretty trivial to work out how fast data is growing, month-on-month, which can help with resource planning.
I’m acutely aware of the fact that I’m ignoring the root cause of the queries being so slow, but when you need the data now, this approach works.
PS I should ‘fess up here: I’d never put a name to them before writing this post, but “milestone tables” seems to fit.
Extended properties are an arbitrary name/value string pair that you can add to any SQL Server object: database, schema, table, column, constraint, trigger or index. Most resources suggest they are useful for either documenting your database (“here’s what this table is used for”), or adding display-specific data for a column (e.g. a regex mask for a phone number field).
Let’s say I’ve a table called Car, defined as:
CREATE TABLE dbo.Car ( MakeID INT NOT NULL ,ModelID INT NOT NULL ,[Description] VARCHAR(255) NOT NULL ,CONSTRAINT PK_Car PRIMARY KEY CLUSTERED (MakeID, ModelID) ,CONSTRAINT FK_Car_MakeID FOREIGN KEY (MakeID) REFERENCES dbo.Make(MakeID) ,CONSTRAINT FK_Car_ModelID FOREIGN KEY (ModelID) REFERENCES dbo.Model(ModelID) )
I can add a piece of data (I’ve named it “Usage”) that explains where/how the Description field is to be used:
EXEC sp_addextendedproperty @name = N'Usage' ,@value = N'Holds the paragraph of text that is displayed in the listings page, and on the mobile site.' ,@level0type = N'SCHEMA', @level0name = 'dbo' ,@level1type = N'TABLE', @level1name = 'Car' ,@level2type = N'COLUMN', @level2name = 'Description' GO
Click here for the full MSDN definition of sp_addextendedproperty.
I can get the values back out by using fn_listextendedproperty:
SELECT objtype, objname, name, value FROM sys.fn_listextendedproperty (NULL, 'SCHEMA', 'dbo', 'TABLE', 'Car', 'COLUMN', 'Description') GO
objtype objname name value ------------ -------------------- -------------------- -------------------------------- COLUMN Description Usage Holds the paragraph of text ...
As mentioned previously, I can even add properties to a constraint. So why does my PRIMARY KEY look like it does..?
EXEC sp_addextendedproperty @name = N'Explanation' ,@value = N'We used a composite primary key because, e.g., both Bentley and Lincoln have models called the Continental.' ,@level0type = N'SCHEMA', @level0name = 'dbo' ,@level1type = N'TABLE', @level1name = 'Car' ,@level2type = N'CONSTRAINT', @level2name = 'PK_Car' GO SELECT objtype, objname, name, value FROM sys.fn_listextendedproperty (NULL, 'SCHEMA', 'dbo', 'TABLE', 'Car', 'CONSTRAINT', default) GO
To see all the extended properties for the current database:
SELECT * FROM sys.extended_properties GO
class class_desc major_id minor_id name value ----- -------------------- ----------- ----------- -------------------- ------------------------------ 1 OBJECT_OR_COLUMN 1269579561 3 Usage Holds the paragraph of text... 1 OBJECT_OR_COLUMN 1285579618 0 Explanation We used a composite primary...
What other uses do you know of for extended properties? I’d love to hear from you in the comments below.
[Bonus tip while I’m thinking about documentation: Create a schema called ‘aaa’ (hence it’ll appear at the top of the list of tables in SSMS); add a table called aaa.README, and fill it with everything a fellow developer or analyst might want to know. Furthermore, it always helps to have a properly maintained changelog table, e.g. aaa.CHANGELOG, in addition to your usual change/control documentation.]
In my first proper post, I want to share with you the number one tip that (a) I wish I’d been told 20 years ago, and (b) I endeavour to tell as many people about as will care to listen. It’s very simply this: put a default create date on every row. In every table. In every database. No exceptions (ok, some exceptions, I’ll come to that).
Why should we do this? Because it’s one of the most fundamental questions you can ask of your data: When was it created? When did that customer sign up? When was that email scheduled? When did we start selling product ABC? But it also gives you information about the opposite: when did something not happen? E.g. the website went down, and no orders came through – how long did the outage last? With a default create date, you can answer that. (Also, it’s insanely useful for debugging, I don’t know how you’d cope without it.)
To add a default create date to an existing table called MyTable, you would run the following command:
ALTER TABLE dbo.MyTable ADD CreatedOn DATETIME NOT NULL CONSTRAINT DF_MyTable_CreatedOn DEFAULT(GETDATE())
If you’re creating a table from scratch using DDL (i.e. typing it into a query window, not using a wizard), you would include the line:
CreatedOn DATETIME NOT NULL CONSTRAINT DF_MyTable_CreatedOn DEFAULT(GETDATE())
Let’s pull that apart:
CreatedOn : Call it what you want: DateCreated, DateInserted, anything that you like (as long as it makes sense).
DATETIME NOT NULL : It’s of type DATETIME, and it can’t be blank (NULL).
CONSTRAINT DF_MyTable_CreatedOn : In SQL Server, defaults are declared as constraints; you don’t have to name your constraints, but if you want people to like you, you should. (I always start my default constraints with DF_.)
DEFAULT(GETDATE()) : GETDATE() is the built-in SQL Server function for returning the current date/time. DEFAULT() means “unless specified otherwise, this is what will be put in the column when you insert a row”.
If you’re ALTER-ing an existing table to add a new column with a default date defined on it, then you have two choices:
- Define the column as NOT NULL-able. Every existing row will get the default value as it stands now. That means every existing row will get the date of when you issued the command. Each new row will get the current date at the point it’s inserted, as you’d expect.
- Define the column as NULL-able. Every existing row will have a NULL, but each new row will get the current date. However, by inserting a row where the CreatedOn is it explicitly set to NULL, you won’t get the default date, you’ll get a NULL.
Exceptions: Nearly every time I thought I didn’t need a default create date, I was wrong. Lookup / type tables? Someone will always come up with another category; just because you can’t think of it now, doesn’t mean someone else won’t in the future. And then you’ll want to know when that new category was added. The only real exceptions I can think of are (a) tables that are fully under your control (e.g. I regularly extract XML into hierarchies of tables; I only care about the create date on the table at the top of the tree), or (b) situations where diskspace is extremely limited – a DATETIME takes up 8 bytes of space, multiply that by a few billion rows, and you may not have room… but in which case, you’ve bigger problems anyway!
What next? Well, I don’t only have default create dates; if it’s appropriate, I’ll also have a default create user. That is, what was the SQL username of the user responsible for inserting a row into the database?
CreatedBy NVARCHAR(128) NOT NULL CONSTRAINT DF_MyTable_CreatedBy DEFAULT(SUSER_SNAME())
, where SUSER_NAME() is a built-in function that “returns the login identification name of the [current] user”.
Also, I like to keep track of when/who last updated a row:
UpdatedOn DATETIME NULL UpdatedBy NVARCHAR(128) NULL
, where these are maintained using a trigger:
CREATE TRIGGER dbo.tr_MyTable_Update ON dbo.MyTable AFTER UPDATE AS BEGIN SET NOCOUNT ON UPDATE mt SET mt.UpdatedOn = GETDATE() ,mt.UpdatedBy = SUSER_SNAME() FROM dbo.MyTable mt JOIN Inserted i ON i.MyTableID = mt.MyTableID END
Of course, this simple trigger isn’t a replacement for real auditing, it’s just the headlines: who last updated the row, and when? It doesn’t tell you what data they updated.
One final point: sometimes I see that the developer has put a CreatedOn column in the table, but left it without a DEFAULT. They plan to set the date from their own application code, and think this is sufficient. There are two problems with this:
- Any rows inserted outside of their code won’t get a date.
- The clock on the webserver (say) may not be sync-ed to the clock on the database server. Potentially, the CreatedOn dates will be offset depending on which webserver is talking to the database. (Note: If you’ve got the capacity, I’m completely supportive of having a second DATETIME column that records the webserver’s timestamp…)
Any thoughts or comments? Please let me know.
Hello, my name is Pete and I’ve been building and maintaining Microsoft SQL Server databases for nearly 20 years. At no point has my job title ever mentioned SQL or databases; I started off building websites that used SQL Server as a back-end, and have somewhat drifted over the years to become a statistical modeller and analyst, but I still spend a large portion of my day typing SQL commands into Management Studio. I’m no expert in any aspect of SQL Server (and certainly no DBA), but I’ve built up enough experience to feel comfortable blogging about how I use it: what works for me, and what doesn’t.
Hence, I don’t claim that anything I write will be definitive, authoritative, or the ‘best way’ of doing it. I’m always learning, and SQL Server itself is an evolving product.
I think I should also mention that I generally work on databases up to a few hundred gigabytes in size. These days, such sizes are not considered “a lot of data”. I’m well aware that methods that work for smaller databases (e.g. databases measured in gigabytes) will not always perform as well for large databases (multi-terabyte and above). Caveat lector, “let the reader beware”!