Posts Tagged gaps and islands

Inexact date grouping using islands

A few years ago at SQLBits, I was fortunate enough to attend a fascinating lecture given by SQL Server MVP Itzik Ben-Gan on Gaps and islands:

Gaps and islands problems involve finding a range of missing values (gaps) or a range of consecutive values (islands) in a sequence of numbers or dates.

Recently, I had to produce a dataset which showed how our interest rates varied over time, by product; for example, products A to E started at 10% through to 50%, but have been adjusted periodically to where they are today, a few points different. Practically, most of the changes have been made at or around the same time — but not exactly. For technical reasons, the specified rates aren’t stored in the database, so there’s no InterestRate table, or a parent table called InterestRateSet that links them together. However, the final result of an application is stored, so we know that a product has been sold, and what the corresponding interest rate was on that day.

The challenge was to work out how many sets of interest rates we’ve had since day 1; but because not every product is purchased every day, if we group by product/day, then it looks like our rates change more often than they do. This is where the ‘gaps and islands’ concept comes in, and luckily I remembered the lecture from a few years before. I found and tweaked some of Itzik’s code from a 2012 article Solving Gaps and Islands with Enhanced Window Functions (SQL Server Pro website) to accept a UDT (in this case, a list of dates). [See previous post, Passing structured data to stored procedures.]

Here it is:

-- Drop/Create our UDT:
IF EXISTS (SELECT * FROM sys.types WHERE is_table_type = 1 AND name = 'DateListType')
	DROP TYPE DateListType
GO
CREATE TYPE DateListType AS TABLE ([Date] DATE)
GO

-- Drop/Create our function:
IF OBJECT_ID('dbo.fn_GetIslandsFromDateList') IS NOT NULL
	DROP FUNCTION dbo.fn_GetIslandsFromDateList
GO
CREATE FUNCTION dbo.fn_GetIslandsFromDateList 
(
	@dateListType DateListType READONLY
	,@GapSize INT
)
RETURNS TABLE 
AS
RETURN 
(
WITH cte_Distinct AS
(
    SELECT
	  [Date]
    FROM @dateListType
    GROUP BY [Date]

), cte_Part1 AS
(
    SELECT
      [Date]
      ,CASE
        WHEN DATEDIFF(day, LAG([Date]) OVER(ORDER BY [Date]), [Date]) <= @GapSize
        THEN 0 ELSE 1 END AS IsStart
      ,CASE
        WHEN DATEDIFF(day, [Date], LEAD([Date]) OVER(ORDER BY [Date])) <= @GapSize
        THEN 0 ELSE 1 END AS IsEnd
    FROM cte_Distinct
)
, cte_Part2 AS
(
    SELECT
      [Date] AS RangeStart
      ,CASE
        WHEN IsEnd = 1
        THEN [Date] ELSE LEAD([Date], 1) OVER(ORDER BY [Date]) END AS RangeEnd
      ,IsStart
    FROM cte_Part1
    WHERE IsStart = 1 OR IsEnd = 1
) 
    SELECT
      ROW_NUMBER() OVER (ORDER BY RangeStart) AS ID
      ,RangeStart
      ,RangeEnd
    FROM cte_Part2
    WHERE IsStart = 1
)
GO

Some things to note:

  • dbo.fn_GetIslandsFromDateList is an inline function, which you can almost think of as ‘a view that takes parameters’ (a parameterised view).
  • You can use CTEs (Common Table Expressions) in inline functions. I love using CTEs, they can make the code very readable. Often, the parser turns them into standard sub-queries, so there’s no performance hit.
  • The @GapSize parameter controls how far apart our islands can be — see the examples below.
  • If you follow the code through, and break it down in to its component parts, you can see how it works — like all the best code, it’s very neat and compact.
  • To re-emphasise, this isn’t my algorithm, it’s Itzik Ben-Gan’s; I’ve done little more than re-format it for my own use.

Let’s feed some dates into our function:

DECLARE @myList DateListType
INSERT @myList([Date])
    VALUES('2017Feb01'),('2017Feb02'),('2017Feb03')
      ,('2017Feb06'),('2017Mar01'),('2017Mar02')

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO

ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-03
2    2017-02-06 2017-02-06
3    2017-03-01 2017-03-02

With a @GapSize of 2, we get 3 ranges (islands). With a @GapSize of 3:

SELECT * FROM dbo.fn_GetIslandsFromDateList(@myList, 2)
GO
ID   RangeStart RangeEnd
---- ---------- ----------
1    2017-02-01 2017-02-06
2    2017-03-01 2017-03-02

, we get 2 ranges, because the difference in days between 2017-02-06 and 2017-02-03 is less than or equal to 3.

So this code did the trick, and allowed us to work out exactly how many different sets of rates we’d actually had live. Yes, we could’ve worked it out by hand; but now we’ve got some reproducible code that can drive various different reports, that’ll show us exactly how our changes have affected the business.

A final thought: Quite often, solving a problem comes down to just knowing the right phrase to google!

Advertisements

, , ,

Leave a comment