Two subtitles at the same time

This isn’t one of my typical posts about data, maths or statistics, it’s related to one of my recent diversions – learning the Welsh language. I’m aware this post might be a little too technical and/or fiddly for many non-IT people, but hopefully it’ll be useful for someone somewhere trying to learn a new language.

In November 2021, I started learning Welsh on Duolingo; partly because everyone else seemed to have started learning a language in lockdown and I didn’t want to be left out(!), and partly because I thought it’s something my late Gran (from Pontardawe, a few miles north-east of Swansea) would’ve appreciated. Thanks to her, I’ve been using the phrase ‘Ych a fi!’ (meaning ‘yuck!’) since I was little!

After about a year, I’d completed all the lessons on Duolingo*, so I started watching Welsh-language shows on S4C, the Welsh-language TV channel here in the UK. S4C programmes are also available via its Clic app and website, and also via BBC iPlayer. Currently, iPlayer only has English subtitles available for Welsh-language programmes; while Clic has English subtitles for some programmes, and Welsh for a smaller number.

(*) Welsh is a shorter course than many others on Duolingo

In order to help with my learning, what I wanted was a way of displaying both English and Welsh subtitles at the same time while watching a programme. Below I set out how I achieved that. I’m going to demonstrate using a new show “The Story of the Language” (in Welsh, “Stori’r Iaith”), as that has Welsh subtitles available via S4C Clic.


From hereon, I’m using the Windows operating system, but everything should work in an almost identical way on a Mac or Linux machine.

Downloading the programme

The software I’m going to use to display both sets of subtitles relies on me having a copy of the show on my computer. I’m a long-time user of the marvellous get_iplayer app, which you can download here: www.squarepenguin.co.uk/downloads. It’s a command-line app, but very simple to use. You might want to look at the download guide here, which gives an overview of how to use the app.

NB: I occasionally see questions about whether it’s illegal to use such software, but I’ve never seen any other response than “Yes, it’s completely legal”; I think you’re supposed to delete the files from your computer 30 days after downloading them, and get_iplayer will automatically delete video files it believes are past any such licensing date.

The programme we’re interested in, Stori’r Iaith, is here; the description is “Famous faces go on a personal journey to explore the history of the Welsh language.” The first episode is here, and features presenter Sean Fletcher.

Stori'r Iaith: Sean Fletcher

Quick aside: I first learned about Stori’r Iaith in my Welsh learner’s magazine, Lingo Newydd, highly recommended!

In order to use get_iplayer, we need the PID, the programme identifier. Fortunately, it’s contained in the URL for the episode, and in this case it’s p0dylrx4. After installing get_iplayer, I can paste the following into a command prompt (sometimes also known as the DOS prompt):

get_iplayer --subtitles --tvmode=hlshd,hvfhd,dvfhd,hvfxsd,hvfhigh --pid=p0dylrx4

The argument --subtitles says that we’d like to download any available subtitles. The --tvmode argument is rather too complicated to go into here, but this is basically asking for the best possible version of the video, in terms of resolution and bit-rate.

On my very slow broadband, it takes about 10 minutes to download the episode, and we end up with two files:

Storir_Iaith_-_Storir_Iaith_Sean_Fletcher_p0dylrx4_original.mp4
Storir_Iaith_-_Storir_Iaith_Sean_Fletcher_p0dylrx4_original.srt

The .mp4 file is 592 MB in size, and can probably be played in any video player software you have on your computer. The tiny .srt file is the subtitle file; it’s a text format, you can open it in any text editor and see the subtitles.


Playing the video

I play all my video files using an app called Media Player Classic Home Cinema (MPC-HC), which you can find here. As the website says, it hasn’t been developed or supported for a number of years, but it’s always worked well for me. An alternative player I often find useful is VLC media player, videolan.org.

So, we have the video (mp4) file, and the English subtitles, now to get the Welsh subtitles.


Welsh subtitles

It’s more fiddly to get the Welsh-language subtitles, but it’s easy when you’re used to it. We need to go to the episode on the S4C Clic site, which is here: www.s4c.cymru/clic/programme/808323693.

Now whilst on that page, in your browser (I use Chrome), open the Developer Tools (probably Ctrl-Shift-I) and go to the Network tab. Hit play on the video. In the Network tab, you’ll see files load into your browser – you’re looking for a file named C_{something}.vtt. You can filter the files: in the ‘Filter’ box, type ‘vtt’, and you should see something like the below:

The .vtt file beginning with ‘E’ is the English subtitles, ‘C’ is the Welsh subtitles (‘C’ for ‘Cymraeg’, meaning ‘Welsh’). Right-click on the ‘C’ file, and select ‘Open in new tab’. A ‘Save As…’ dialogue box will pop-up, you can then save the file somewhere for use in the next stage.

Converting the subtitles

Unfortunately, the subtitles from S4C Clic aren’t in a format we’re going to be able to use immediately. Luckily, there’s a website out there that lets us convert the file into a format we can use, very easily: happyscribe.com/subtitle-tools/subtitle-converter

On that webpage, simply click the ‘Select File’ button and choose the .vtt file you saved in the previous step, select ‘.srt’ as the output format, then click the ‘Convert subtitles’ button.

You’ll then be prompted to save the new .srt format file, which we’ll use in the next step.


The plugin

In order to load and play both subtitle files at the same time, Media Player uses a plugin (filter) called VSFilter, which you can find here. The version I downloaded was VSFilter_2.41.322_x64.exe (857 Kb).

When the filter has been installed, you can go ahead and open the mp4 video file in Media Player. The English subtitles will be picked up automatically, as the subtitle file is in the same folder, and has the same name (only the extension is different). To add in the Welsh subtitles, right-click anywhere on the video window, select Filters / VSFilter (auto-loading version) / Properties, and then on the default ‘Main’ tab, click the ‘Open’ button, navigate to the C_A860022577.srt file downloaded from happyscribe, click ‘OK’, then you can play the video and both sets of subtitles will appear, like so:

And that’s it!

I deliberately chose this scene, as it features Stephen Owen Rule, known as Doctor Cymraeg on twitter, who very kindly helps out the Welsh learners there by answering questions and giving tips. His website is doctorcymraeg.wales, and he has written several books for Welsh learners that are available at e.g. Amazon.


A final note: if you’re learning French and have a VPN, you may be able to download video files from france.tv, along with French subtitles, using yt-dlp, a fuller-featured fork of the popular youtube-dl. The corresponding English subtitles may be available on opensubtitles.org, then as above, you can use MPC and VSFilter to watch both sets at the same time. Caveat: As far as I’m aware, it’s legal to do this; if you know otherwise, please let me know.

Advertisement

, , , , , , , , , ,

1 Comment

Things That Could Have Been (Part Four): Balance viewer

This is part four, the final part talking about four mini-projects that nearly made it. Part One, “Process flow and DiagrammeR” is here. Part Two, “Better credit file summaries” is here, and Part Three, “The Daily Journal” is here.


Balance Viewer

Two of the loans companies I worked for were traditional start-ups, and in that spirit, they didn’t purchase third-party software to run their customer accounts — they ‘rolled their own’. (Having subsequently been exposed to some of these third-party products, I can quite understand why; I think the politest thing I can say is that I’ve been ‘underwhelmed’ by them.)

Within these account systems, you have transactions, mainly payments out (the initial loan), interest, payments in (card, direct debit, etc.), and occasionally charges for late payment. There are also reversals, refunds, write-offs, a whole menagerie of types, some quite esoteric.

Alongside the transactions, you need to keep a track of the outstanding balances; the Total (how much the customer owes), and (usually) three sub-balances relating to the principal (how much is left of the original loan), interest, and charges. The three sub-balances add up to the Total.

When a transaction is added, it might have an effect on any of the sub-balances: e.g., if a customer makes a payment of £100, and their charges sub-balance is £20, then that is paid off first, and there is only £80 left for the interest sub-balance, and then finally the principal.

At one company, we didn’t maintain historic values for the all the sub-balances; when we got support requests for these figures, they had to be calculated manually. Not a complicated piece of work, usually 10 minutes or so with Excel.

As the volume of these requests increased, we realised we needed a better solution. I wrote some code to mimic the Excel calculations, and an overnight process rebuilt the historical balances and sub-balances for all live loans. The next step was to make this data available to staff in the appropriate departments.

What it would’ve looked like

The output was (probably) going to be a an Excel sheet, like the following: (It was rather wide, so I’ve split it into four narrower pieces to display it here)

NB: The ‘ROLL’ column on the end was the rolling balance (i.e. simple adding and subtracting of transaction amounts, without reference to sub-balances). It was used as sense-check, the Total outstanding balance and the rolling balance should always be identical.

What was good about it?

In common with the three previous posts, it’s all about exposing and/or generating information in a helpful way. This viewer made it explicit how balances were calculated, should the business, a customer or auditor require the fine detail about an account.

For around 90% of customers, the transactions and balances were simple: a single ‘payment out’, some iterations of ‘interest applied’ followed by ‘payment in’, and that was it. For the other 10%, their accounts contained less common transaction types, e.g. chargebacks, refunds, write-offs. Now, it’s a fundamental precept with accounts: the transactions and the order they happen in are immutable: you can’t delete an incorrect or failed transaction, nor can you move it to a different date. This means that if, say, a card payment that had cleared interest and charges is subsequently charged back, the sub-balances may be incorrect. (The total outstanding will be correct, but going forward, interest may be missed.) We could easily see that here, and issue adjustments if necessary.

In reality, probably the main use for this viewer would’ve been spotting incorrect transactions; we’d occasionally see mis-typed transaction amounts and references, and it wasn’t uncommon to see manual refunds that could’ve been write-offs, etc. And anyone that’s dealt with automated transaction files from banks and similar large financial organisations knows how, er, grimy the data can be!


How close was it to being used? All the underlying data was available, in our ‘clean’ database (see previous posts), all it needed was for the output to be somehow presented in the front-end of the customer administration system. What form that presentation would’ve taken, I don’t know — we didn’t get that far. It could’ve just been the Excel output, or it could’ve been something fancier in HTML/javascript*.

* I remember now(!), I’d previously spec-ed out a small database-driven system for some simple reporting: basically, it would paste data into an Excel spreadsheet, run a custom macro (to tidy up layout and formatting), then deliver the output to the browser, or send it via email. We definitely could’ve used that for the Balance Viewer.


That concludes this mini-series of posts, I hope it was at least partially interesting to someone!

, , ,

Leave a comment

Things That Could Have Been (Part Three): The Daily Journal

This is part three of four mini-projects that nearly made it. Part One, “Process flow and DiagrammeR” is here. Part Two, “Better credit file summaries” is here.


The Daily Journal

In more than one finance company I’ve worked at, the Credit Risk and MI/BI departments are expected to be all-seeing and all-knowing, and they will have input into every aspect of the business*. Which means they need lots of data, and lots of reports, charts and figures.

* It’s not unusual for the Head of Credit Risk to be more of a hands-on, operational role, overseeing every part of the process in detail — authorising website changes, training collections teams, liaising with external auditors, etc.

The credit risk department is expected to create a monthly pack showing (and commenting on) all sorts of stats, from basic “number of applications”, “proportion of apps paid out”, “total money in / out”, down to the fine detail of how each variable in a scorecard is performing. Several years ago, while I was creating this monthly pack for the company I worked for, I realised that (a) it could be largely automated — after all, my scorecard build documents are automated (using the same tools); and (b) if it’s automated, you could have a version that included much more information.

Thus was born The Daily Journal: a very large document that contained hundreds and hundreds of graphs showing the performance of pretty much every credit risk metric — from basic stuff like application rates and maturity rates, to more niche measures like population stability index values for individual scorecard variables; shown over various time periods (e.g. ‘Last 30 days’, ‘Last 3 years’), divided into as many sub-populations as made sense (e.g. new customers vs old, homeowners vs renters, married vs single, by age range, etc.)

Basically, the Journal was to be a data bible — anything someone could reasonably want to know about the business’s operational performance, it’d be in there somewhere! So I started putting it together…

Examples

The last version of the document in development ran to over a hundred pages. Below are a few example graphs from that document:

Note that there were only 6 or so basic types of graph needed.

Benefits

At first thought, a huge document like this might seem like overkill. But there were several obvious benefits:

  1. Historics. Sometimes, it’s not easy to generate data ‘as it looked in the past’. For one, most dev teams don’t factor this into their database builds. Also, code changes all the time, and it can be nigh on impossible to work out how a system operated X years ago. But with the Journal, we could look at old ‘editions’ and see exactly how things were. (And of course, fixed documents can’t change, which you can’t say about live reports)
  2. It’s somewhere to keep important milestones/dates, and they could be automatically added to graphs. E.g. the release of new scorecard, acquiring a new marketing partner, launch of a new application system, etc. (That’s what the vertical dotted lines are in the examples.)
  3. Similarly, manual commentary can be included. Either by using ‘include’ files, or straight from a database.
  4. It could be an incredibly useful introduction for people new to the business, getting them talking about how things work, and even suggesting additions to the document.
  5. I wanted the document to be available to the wider business, so anyone who needed to could access it and answer their own questions. Not everyone’s happy with tools like Power BI, but anyone can open a PDF.

How it was built: Sweave

I built the document using Sweave (R Studio: Using Sweave and knitr), which allows you to build documents by embedding R code within \LaTeX (a venerable typesetting language, much used in academia — see here). Because so much of the code was automated (and we were using our nice ‘clean’ database, see previous posts), and the graphs were all very similar, it didn’t take much effort to generate – it was just a matter of coding “all the levels of variable A over time, split out by variable B (and maybe C)”.

Here’s some pseudo-code to help get the point across:

dataFrame = CreateDataFrame("SELECT CustomerType, AgeBand, [Day], COUNT(1) AS Total
	FROM app.Application WHERE PaidOut >= '2019Jan01'
	GROUP BY CustomerType, AgeBand, [Day]")

For(customerType in levels(dataFrame.CustomerType))
For(ageBand in levels(dataFrame.AgeBand))
	Print "\section{Customer: " + customerType + ", Age: " + ageBand + ")"
	DrawGraph subset(dataFrame, (CustomerType = customerType) & (AgeBand = ageBand))
Next
Next

The above code could generate dozens of graphs.

Arguments against

Not that I got to the point where people could criticise the need for it, but pre-empting some arguments anyway…

Couldn’t you achieve the same results using Power BI?
Sort of. I saw the Journal as a companion to the live reports — it represented the past, whereas most reports are largely about the present and/or recent past. (Also, as mentioned above, sometimes it’s just easier to open a PDF.)

Wasn’t navigating around the document a pain?
\LaTeX builds tables of contents automatically, plus setting up links to jump around PDFs is easy. And LaTeX is so flexible, you could colour-code pages/tabs. If the document truly got too unwieldy, it’d be simple to set up a separate build process that had a curated selection of the pages, maybe a document for each department.

Surely not every graph or cross-tab is interesting?
Of course, and it’s trivial to exclude uninteresting variations.

Wasn’t the file huge?
Not really: under 5 MB. It was all text and embedded PNGs. (I used to work at a company that treated 250 MB Excel spreadsheets as ‘the norm’, so…)

Didn’t this take ages to make?
No, I think it took a few days, most of which was faffing around with colours, line widths and getting legends in the correct place on the graphs! The underlying data already existed, correctly indexed, this was just churning out graphs (and simple tables) in loops. The document was really just an expanded version of the monthly credit risk report — nearly all the ‘grunt work’ had been done already. Plus, it also covered quite a few recurring ad-hoc requests for stats.


How far did I get? About 95% of the way to the initial release. More urgent work got in the way, and it fell by the wayside. The usual!

Future plans: Lots more credit risk-related graphs and tables. As mentioned above, including manual commentary on graphs, and possibly department specific versions if required.

, , , , , , ,

Leave a comment

Things That Could Have Been (Part Two): Better credit file summaries

This is part two of four mini-projects that nearly made it. Part One, “Process flow and DiagrammeR” is here.


Better credit file summaries

At the time we were using their product, a certain credit bureau (that shall not be named) charged extra for the additional ‘chunk’ (let’s call it that from hereon) of data that was the summarised version of a credit file. It had a few hundred data points, compared to the thousands within the credit file. For example, the credit file might list the 5 unsecured loans an applicant had, 3 live, 2 settled; in the chunk were variables related to this, e.g. NUMBER_TOTAL_LOANS = 5, NUMBER_LIVE_LOANS = 3, NUMBER_LOANS_IN_DEFAULT = 1, etc. This chunk was used by the bureau themselves to build their credit score(s). (In theory, given enough credit files, you could re-engineer their score…) Also, as is common, I used the chunk data to build our applicant accept/reject scorecards.

For our company, there were two problems with the chunk:

  • It cost money! The amount of credit files we saw, it added up to many thousands of pounds a month.
  • The data in the chunk wasn’t updated at the same time as the main credit file data — there was a lag.

Particularly in sub-prime lending, when changes to an applicant’s credit file can be frequent, this meant the data was occasionally out of sync. For the majority of applications, this wasn’t a problem – the credit file was an obvious pass or fail. But for cases ‘on the cusp’, it caused back-and-forth between the underwriting and credit risk departments. “Hey Pete, your scorecard has approved this applicant, but they’ve got too many delinquent loans, how has this happened?” This situation was inevitable — the applicants were being scored (by my scorecard) on data that was potentially different to that seen on our internal credit file viewer, as used by the underwriters.

So, I worked on a side-project to generate our own version of the chunk:

Phase 1: Recreate (and understand) the existing variables

I back-engineered (most of*) the existing variables from the bureau. A number of issues were immediately obvious:

  1. The documentation of the variables supplied by the bureau was often incomplete and/or vague.
  2. The raw data wasn’t always consistent; for example, a loan was marked as settled, but didn’t have an end date. Some chunk variables used loan status to determine whether a loan was settled, some variables used the end date.
  3. A non-zero number of the variables were just wrong; one variable was supposed to be “Number of X within 6 months” but it was actually measuring 5 months!

* It was impossible for a few variables, the bureau didn’t expose all the underlying data.

Phase 2: Create and add our own new variables

Given my experience of building scorecards, I had ideas about what aspects of the data were likely to yield predictive summary variables. For instance, I knew that variables relating to the composition of the applicant’s current and previous households had been a decent predictor in other scorecards.


How far did we get? About 85% of Phase 1.

I’d started talking to the developers about how to get this into the production pipeline; my code was all SQL, which would’ve worked, but it felt like the wrong tool for the job; I wanted a tight, single-purpose piece of code that could generate our version of the chunk in milliseconds. (I could then apply it to all prior credit files we had in the system.) This was perfectly achievable; but then for phase 2, I wanted a sub-system flexible enough so that I could add new variables without having to modify existing C# code and re-deploy. Oh, and it also needed to be tightly versioned.

Although we talked about it, we didn’t manage to come up with a solution that everyone was happy with. I knew it’d be possible using XSLT (XML transforms), but due to the complexity of credit files, the transform XML would have been ugly to read and hard to maintain. (Actually, as I write this, I realise that maybe two XSLTs in series could’ve done the job elegantly… Hmm!)

, , , ,

Leave a comment

Things That Could Have Been (Part One): Process flow and DiagrammeR

You’ll be unsurprised to read that not everything I’ve worked on has made it into a live production environment. It happens — projects are abandoned for reasons of resource, time or requirement, and every seasoned developer is used to it. In this post, and the next three, I’m going to present four mini-projects that were the closest to making it ‘into live’. One day, I hope I’ll get the chance to resurrect them in some form.


Process flow and DiagrammeR

Nearly every company I’ve worked at in the past 20 years had, or was in the process of building, (*drum roll, deep resonant voice*) The Funnel. It’s a completely obvious concept: at the top of the funnel, start with all your potential customers (people you market to/contact, or applications, say). Then for each step of the process, show how many customers are left at each stage.

For a loans company, we’d have something like:

Clearly, this is only a very high-level overview, and by design leaves out fine details.

Here’s a slightly more interesting example on the same theme:

This shows us the process flow (aka workflow) for the initial part of the system, from application, via credit check, to the Affordability stage. This sort of flow diagram isn’t a replacement for the funnel, it’s an expanded, more detailed, view.

I made the above example using a wonderful R package called DiagrammeR, which is very straightforward to code for. More visual examples can be found here.

Real-life workflows

Below are two screenshots of actual workflows, for new and existing customers, taken from a real-life lender. The first stage is application, the last is payout. As you can see, the flows are quite complex; when I showed them these pictures, management were rather surprised at how complex the workflow had gotten.

Both diagrams were generated automatically using the DiagrammeR package. The total amount of code required was surprisingly small: a couple of dozen lines of SQL, pulling from our ‘clean’ database*, and a few dozen lines of R. The intent was to generate these diagrams (and various department-specific versions) on a schedule, and distribute them around the business, in order to:

  1. Get a better understanding of how applications moved through the business
  2. Help explain to non-technical staff what was going on ‘behind the scenes’
  3. Look for opportunities for improvement to the application process
  4. Keep an eye out for potential issues

In short, it made the invisible visible, which could only be a good thing.

* We had a database that was a cleaned-up and nicely re-organised version of the live data. It wasn’t a warehouse, as such, but it made common data requests much simpler.

 

, , , ,

Leave a comment

SQL Server Agent broke (and how caching helped)

SQL Server Agent broke!

NB: The first part of this post doesn’t apply to Azure SQL Server, as it doesn’t have SQL Server Agent. Instead, it has Elastic Job Agent, which (at time of writing) only has a limited amount of functionality.

One of the database servers I work on is old (as is the system it runs) but it’s pretty busy – tens of thousands of new records every minute, all day, every day. Because it’s old, all the reporting is served from the same database server; far from ideal, but anyone who does what I do is used to that by now.

Now, due to $AWFUL_REASONS, the system clock on the server was put forward in time by several weeks, then a few minutes later, put back to the correct time. As you can imagine, this caused a BIG problem, and the invention of multiple new swearwords.

For us, the main two issues it caused were:

  1. For the few minutes where the clock was incorrect, data now had wildly incorrect timestamps. Hence, all the reports that used that data were now wrong, to some degree. And that was A LOT of reports.
  2. All the SQL Server Agent jobs stopped! The jobs mainly fall into two categories: (a) email alerts (e.g. “Too few records in the last hour from Company A!”) and system checks (e.g. “Data drive X is nearly full”), and (b) aggregating data down for reports.

#1 was a pain, but correctable (and more on that in the second part of this post). #2 took me a lot of googling, then a couple of boring hours to fix — here’s what I did:

Fixing the agent jobs

First, note that SQL Server stores all the data relating to Agent Jobs in the msdb database; see here for an overview: SQL Server Agent Tables (Transact-SQL). The table we’re most interested in is sysjobschedules; here’s the first few rows from the table on a different server:

SELECT * FROM msdb..sysjobschedules

The next_run_date and next_run_time store the next date/time the job is scheduled to run (in a slightly funky format, i.e. not native SQL datatypes like DATE or DATETIME). The ones in the example above are fine, but ours were all several weeks in the future.

How to reset the job schedules, so the next_run_date and next_run_time were correct? In a nutshell, none of the suggestions I found via Google worked; they were:

Clearly, these solutions did work for some people, but not for us. In fact, I found only one thing worked: Going into every single job, deleting the existing schedule, and recreating it from scratch. Very boring! I don’t like to publish blog posts where I admit defeat, but that’s life, and sometimes you have to do the annoying thing.

One final note on this: I suspect that using SQL to DELETE and re-INSERT the records in sysjobschedules might have worked, but I haven’t tested that. One useful tip: if you’re recreating the schedule using the GUI, copy the Summary / Description from the bottom of the dialogue box and paste it elsewhere; when you re-enter the data, you can paste the same text out again and compare – they should be identical.


Aggregated data and caching

As I mentioned above, some of the scheduled jobs (for older parts of the system) were responsible for aggregating data down for reports; e.g., at 5 minutes past every hour, go and aggregate all the records from the previous hour. The tables that stored this aggregated data all had to be run/updated/massaged manually.

However, in part of the system that I’d recently built, the aggregated data is cached ‘on-the-fly’, such that (a) it only aggregates the data it needs, and (b) if data is deleted from the cache, it’s automatically recreated the next time it’s needed. So after I had fixed the data in the live tables, all I had to do for this newer part of the system was delete the cached records, and the code picked up the corrected data next time it was needed. To be honest, I hadn’t built the system with this purpose in mind; I’d built it purely to stop unnecessary data reads from heavily-used live tables. Albeit unintended, this was a nice side effect, that probably saved me an extra few hours of work.

Below is a diagram of how the caching system works:

To re-iterate: the whole caching system runs on the (very busy) production server, so it has to be efficient enough not to cause extra load.

The system also handles the data for alerting, as well as traditional reporting. Because the data might not be contiguous (e.g. “Compare the last hour with this time last week”), the system works on lists of date/time ranges, down to a resolution of 15 minutes (as that’s all we need). These are stored in user-defined table types, and passed around via table-valued parameters. (Oh, and the system makes lots of use of milestone tables to restrict primary key ranges to just the records we’re interested in.)

Hopefully in a future post I’ll get into some more detail about the alerting system that uses this cached data.

, , , ,

Leave a comment

‘You Can’t Always Get What You Want’ – old tech to the rescue!

On a recent project, I had to compare my interpretation of some data, to an existing report on the intranet — however (for various boring reasons), I didn’t have any access to the code behind the report; all I could do was view the report for given input parameters. I knew that, in some cases, the report gave different results to my code, but I didn’t know why. What I needed was many, many instances of the report’s output, stored as data I could query, so I could do a proper evaluation of the differences.

The way the report was coded meant that I could get at the important parts (i.e. none of the site furniture or navigation) via a straightforward URL. This gave me the HTML that made up the report – in other words, the numbers I wanted, plus some markup that needed to be stripped out. So all I had to do, was issue lots of GET requests, parse the numbers I wanted out of the response HTML, then store them in the database.

Regular readers of this blog will know that I like to do as much of the work as possible in SQL Server, and fortunately, there’s a way of issuing POST/GET requests in SQL, using the sproc sp_OACreate (documentation here), which lets you create an instance of an OLE object, and the component MSXML2.ServerXMLHTTP (documentation here) for creating HTTP connections. I have to say, this is old tech — I think I first used this sproc and component (or one like it) at least 10-15 years ago. I’m fairly certain there’s no direct equivalent in Azure SQL Server, although of course there are other new-fangled ways of achieving the same goal!

I’m not going to reproduce my exact code here; most of it was related to parsing the numbers out of HTML with dozens of REPLACE() function calls. (I tried converting the HTML to XML first, but that caused more problems than it solved!) But to demonstrate the MSXML2.ServerXMLHTTP component, I’ve come up with a use case that people might be familiar with: RSS feeds. (Again: old tech, but still very much in use.)

What I’m going to present is code that (a) gets the latest RSS feed data from a couple of different sites, then (b) stores the results.


Set-up

None of this is going to work if you can’t instantiate (create) the MSXML2.ServerXMLHTTP component on your instance of SQL Server. There’s a very good chance that your server has ‘Ole Automation Procedures’ disabled, for security reasons — which means you can’t run the sproc sp_OACreate. Depending on your environment-slash-DBA, you may not be able to get this enabled — but I pretty much always have a local installation of SQL Server, precisely for situations like this. (Perhaps your DBA would be amenable to allowing it on a dev server?)

To see if ‘Ole Automation Procedures’ are enabled, you can run the following code:

DECLARE @object INT
EXEC sp_OACreate 'MSXML2.ServerXMLHTTP', @object OUT

If you get a message like the following:

Msg 15281, Level 16, State 1, Procedure sp_OACreate, Line 1 [Batch Start Line 0]
SQL Server blocked access to procedure 'sys.sp_OACreate' of component 'Ole Automation
Procedures' because this component is turned off as part of the security configuration
for this server. A system administrator can enable the use of 'Ole Automation Procedures'
by using sp_configure. For more information about enabling 'Ole Automation Procedures',
search for 'Ole Automation Procedures' in SQL Server Books Online.

, then it’s not enabled. To enable ‘Ole Automation Procedures’, I did the following:

EXEC sp_configure 'show advanced options', 1
GO
RECONFIGURE
GO
EXEC sp_configure 'Ole Automation Procedures' -- to view the current settings
GO
EXEC sp_configure 'Ole Automation Procedures', 1 -- 1 = enable, 0 = disable
GO
RECONFIGURE
GO

You can read about sp_configure here.

If you run EXEC sp_OACreate… again, and everything’s ok, you’ll get the message Commands completed successfully.


RSS feeds

Below is the DDL for the tables for my little demo RSS application. Note that RSS feeds return XML.

I’ve configured three feeds: two BBC and one Yahoo. You can go here to read about BBC RSS feeds and find some example URLs, and similarly here for Yahoo.

CREATE SCHEMA rss
GO

CREATE TABLE rss.[Source]
(
	SourceID INT NOT NULL CONSTRAINT PK_rss_Source PRIMARY KEY CLUSTERED
	,SourceName VARCHAR(100) NOT NULL
	,SourceURL VARCHAR(MAX) NOT NULL 
	,CreatedOn DATETIME NOT NULL CONSTRAINT DF_rss_Source_CreatedOn DEFAULT(GETDATE())
)
GO

INSERT rss.[Source](SourceID,SourceName,SourceURL)
VALUES
 (1, 'BBC Top Stories', 'http://feeds.bbci.co.uk/news/rss.xml')
,(2, 'BBC Technology', 'http://feeds.bbci.co.uk/news/technology/rss.xml')
,(3, 'Yahoo Finance', 'https://finance.yahoo.com/news/rssindex')
GO

CREATE TABLE rss.Response
(
	ResponseID INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_rss_Response PRIMARY KEY CLUSTERED
	,SourceID INT NOT NULL
	,ResponseData XML NOT NULL
	,CreatedOn DATETIME NOT NULL CONSTRAINT DF_rss_Response_CreatedOn DEFAULT(GETDATE())
)
GO

So with these tables created, all we need to do now is loop over the records in rss.[Source], use the MSXML2.ServerXMLHTTP object to call the URLs, and store the results.

DECLARE @status INT
DECLARE @Response TABLE(ResponseText NVARCHAR(MAX))
DECLARE @HTTPObject INT
DECLARE @SourceURL NVARCHAR(1000)

DECLARE @SourceID INT = 1, @MaxSourceID INT
SELECT @MaxSourceID = MAX(SourceID) FROM rss.[Source]

EXEC sp_OACreate 'MSXML2.ServerXMLHTTP', @HTTPObject OUT

WHILE (@SourceID <= @MaxSourceID)
BEGIN

	DELETE FROM @Response

	SET @SourceURL = NULL

	SELECT
			@SourceURL = SourceURL
		FROM rss.[Source]
		WHERE SourceID = @SourceID

	EXEC sp_OAMethod @HTTPObject, 'open', NULL, 'GET', @SourceURL, 'false'
	EXEC sp_OAMethod @HTTPObject, 'send'
	EXEC sp_OAGetProperty @HTTPObject, 'status', @status OUT

	INSERT @Response(ResponseText)
		EXEC sp_OAGetProperty @HTTPObject, 'responseText'
	
	INSERT rss.Response(SourceID, ResponseData)
	SELECT
		@SourceID
		,ResponseData = CAST(
			REPLACE(ResponseText, 'encoding="UTF-8"', 'encoding="UTF-16"')
		 AS XML)
	FROM @Response

	WAITFOR DELAY '00:00:00.5'

	SET @SourceID = @SourceID + 1

END

EXEC sp_OADestroy @HTTPObject
GO

Things to note in the code above:

  • It’s not production-ready code! If it was, it’d have error-handling!
  • The raw response text is INSERTed into a table variable, @Response.
  • This article at ryadel.com explains why we use a table variable and don’t just put the response into a string. SQL Server has some weird quirks like that; e.g., it complains if you try and INSERT more than a 1000 records directly using VALUES(…), but if the VALUES() statements are combined using a SELECT sub-query first, it’s quite happy!
  • The response text is converted (CAST) to XML for storage. SQL Server complains about the XML if it’s encoded as UTF-8, so we have to change that to UTF-16 first.
  • I’ve put a delay (WAIT FOR DELAY) of half a second in between calls, just so I don’t overload the server. It doesn’t really matter here, I’m running it on my own server, but often when I’m running loops like this, it’s on a production server, and I’ve found it good practice to stick a delay in so the server isn’t hammered!
  • We destroy the object at the end with a call to the sproc sp_OADestroy.

Once the code has run, what do we have?

SELECT * FROM rss.Response
GO

Result of the query

We can parse the XML using SQL to give us, for example, the first 3 news items from each RSS feed:

;WITH cte_Parse AS
(
	SELECT
		r.SourceID
		,SortOrder	= ROW_NUMBER() OVER (PARTITION BY r.SourceID ORDER BY x.i)
		,title		= x.i.value('(title/text())[1]','VARCHAR(MAX)')
		,pubdate	= x.i.value('(pubDate/text())[1]','VARCHAR(255)')
	FROM rss.Response r
	CROSS APPLY r.ResponseData.nodes('//item') x(i)
)
SELECT *
	FROM cte_Parse
	WHERE SortOrder <= 3
GO

which gives us:

Result of the query

From this output, it looks like there isn’t a set standard for dates in RSS feeds; the Yahoo dates could be directly CAST to a DATETIME, but you’ll have to work a bit harder for the dates from the BBC (especially if you want to maintain the timezone).

And finally

Being able to pull data into SQL Server like this is really very handy, it’s a great addition to the toolbox. I wouldn’t necessarily rely on code like this in a production environment, but I’ve been using similar in local, dev and BI environments for years without any problems.

As ever, I’d love to hear from you, please do let me know if you’ve used sp_OACreate and/or any of the HTTP OLE objects for anything interesting or off-beat!

, , , , ,

Leave a comment

Bad numbers days

Here’s a scenario that most BI people will be familiar with: It’s 10:30am, and your boss comes up to you: “Usually we’ve had X applications* by this time in the morning, but today we’ve only had Y! What’s going on?”, where of course Y is a number quite a bit smaller than X. What to do? Why are the figures are as low as they’re being reported?

* For ‘applications’, read ‘events’, whatever’s applicable to you

Okay, some part of `the process’ could be broken — but you’ve probably already got monitoring in place to catch things like services falling over, or external providers going down. There are a couple of statistical possibilities — for one, maybe it’s just a matter of Regression toward the mean. In other words, you’ll see the number of applications (events) you were expecting, it’s just that randomness has distributed fewer in the morning, and more in the afternoon. (This has happened to me too often – I spend my lunch break looking for the problem, only to have the numbers gradually return to normal by early afternoon, and over half of the day has been wasted unnecessarily.)

The other possibility: it’s a ‘bad numbers day’, statistically speaking. But how often do they actually occur?

The next bad day

We’re really interested in answering the question, “When can we expect our next bad day?”, but obviously you can never say with any certainty when it’s going to occur. However, we can look at some simulated numbers, for large numbers of hypothetical histories, and see where those bad days occur, on average.

Say the number of applications (‘events’ from hereon) we see in a day is normally distributed with a mean of 100 and standard deviation of 10 (rounded to the nearest integer). Here’s a histogram of 10,000 random samples from N(\mu=100,\sigma=10). The distribution looks pretty reasonable:

All the samples are independent — that is, the number of events on a particular day are not correlated with the number of events on the next day, or any other day.

(If I was a proper statistician, I’d probably be using a Poisson distribution, or even a non-homogenous Poisson as the events don’t occur with the same probability throughout the day. For illustrative purposes, a rounded Normal is fine.)

Let m_k be the kth rolling minimum (‘instance’), where m_0 is our starting minimum, and let v_i represent the number of events on the ith day. Then, for example, if our initial minimum m_0 = 100 (our mean value) and our first 10 values of v_i are {90, 91, 85, 87, 89, 90, 95, 84, 84,86} then we have k=3 minima {90, 85, 84}, such that m_1 = 1, m_2 = 3, m_3 = 8. In other words, day 1 is our worst ever day; the next worst ever day occurs on day 3; and the one after that occurs on day 8.

We’re going repeat this process 10,000 times, and look at a few different schemes, not just ‘worst ever’ (i.e. ‘worst so far’). Starting with m_0 = 100, we’ll calculate the days on which the number of events is:

  • A : less than the current minimum (m_k = i if v_i < m_{k-1})
  • B : less than or equal to the current minimum (m_k = i if v_i \leq m_{k-1})
  • C : less than the current minimum + 1% (m_k = i if v_i < m_{k-1} \times 1.01)
  • D : less than the current minimum + 10% (m_k = i if v_i < m_{k-1} \times 1.10)
  • E : less than the current minimum + 50% (m_k = i if v_i < m_{k-1} \times 1.50)

We’ll restrict ourselves to the first 8 instances (days on which these minimums occur); it’ll save on processing, plus that’s all we need to show the trend.

I’ve uploaded my R code here, but unfortunately it’s a .docx file, as WordPress won’t allow you to upload .R or .txt files, for ‘security’ reasons.


Results

NB: Because the distributions are asymmetrical and have long upper tails, we’ll report the median, rather than the mean.

Scheme A

The results for our first scheme (A) are below:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
A 2 6 17 51 159 470 1309 3148

This says, that for scheme A, the 1st ‘worst day ever’ happened on day 2 for 50% of the 10,000 simulations; subsequently, the next (2nd) ‘worst day ever’ was on day 6, the 3rd was on day 17, etc. Early on, you encounter lots of ‘worst days ever’, but then they stretch out – there’s nearly a year between the 5th and 6th ‘worst ever’ days, and then over two years until the next one. I’d say this feels about right. (NB: Except for the last one, the ratio between one ‘worst day’ and the next is approximately 3.) Below are graphs showing distributions for each ‘kth worst day’:

NB: For n > 1, to make the shape clearer, the graph only shows the bottom 80% of the distribution.

Schemes B, C and D

As you would expect, with the `looser’ schemes, the median bad days are happening sooner:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
B 1 5 13 34 85 200 458 971
C 1 3 7 12 20 32 50 79
D 1 2 4 6 9 13 18 23

Clearly, the ratios between one instance and the previous get smaller, as the criteria are loosened.

Scheme B
Scheme C
Scheme D

Scheme E

The results for scheme E are:

Scheme 1st day 2nd day 3rd day 4th day 5th day 6th day 7th day 8th day
E 1 2 3 4 5 6 7 8

These figures are really just showing that the range in which the next sample can be considered a minimum is so large, that it’s unsurprising that we’re flagging results as bad. For the first instance (with my particular simulated results), 100% are on day 1; for the 8th instance, 94.2% are on day 8.


End notes

This blog post is the result of an idle thought, after experiencing a `bad numbers day’ at work; my gut feeling was that these days were probably quite frequent, but it looks like they’re more frequent than common-sense might lead you to believe.

(If I had more time to spend on this, I’d look at what the long-term ratios (m_k / m_{k-1}) are for these schemes and others — as mentioned above, for Scheme A it’s about 3, and it’s clearly lower for schemes B to D.)

Of course, the numbers above are purely hypothetical — the real numbers for your situation will likely fluctuate far more on a day-to-day, week-to-week, and month-to-month basis (not to mention intra-day).

In a nutshell: if your numbers are looking low compared to yesterday, don’t panic and assume something’s broken — it might just be probability doing its thing.

, ,

Leave a comment

DATETIMEOFFSET in T-SQL

In my experience, most developers will shudder when you mention DATETIMEOFFSETs to them. I’ve just started working on an application where (most of) the date/time data is stored as DATETIMEOFFSET, and it’s been so long since I’ve had to deal with them, I’ve forgotten most of what I knew! Hence, this blog post to remind myself of the key points, and hopefully it’ll be of use to someone else. (This isn’t supposed to be an exhaustive look at everything to do with DATETIMEOFFSET, just some pointers to the issues I’ve encountered.)

The basics: T-SQL (the variant of SQL used by Microsoft SQL Server) has an extra date/time data type, DATETIMEOFFSET, that stores an offset relating to a timezone. For example, you could represent the time 07:30 (using the 24-hour clock) in Algeria on the 26th July as:

2021-07-26 07:30:00.000 +01:00

And 12:00 (12pm midday) in India as:

2021-07-26 12:00:00.000 +05:30

The datatype stores an offset in minutes; in the string representations shown above, that’s the +01:00 for Algeria, and +05:30 for India.

Coordinated Universal Time, known as UTC, is the ‘base’ timezone, for which the offset is zero minutes, +00:00. In some senses, it’s a successor to Greenwich Mean Time, aka GMT. For our purposes, they’re identical, and interchangeable.

We can see what a DATETIMEOFFSET looks like converted to UTC, using the AT TIME ZONE syntax:

DECLARE @dto_Algeria DATETIMEOFFSET = '2021-07-26 07:30:00.000 +01:00'
DECLARE @dto_India DATETIMEOFFSET   = '2021-07-26 12:00:00.000 +05:30'

SELECT @dto_Algeria AT TIME ZONE 'UTC'
SELECT @dto_India AT TIME ZONE 'UTC'

2021-07-26 06:30:00.0000000 +00:00
2021-07-26 06:30:00.0000000 +00:00

(For all the possible timezones, see e.g. Get a List of Supported Time Zones in SQL Server (T-SQL).)

The times are the same in UTC — if something happens at 7:30am local time in Algeria, and 12pm local time in India, it’s happening simultaneously. We can use the standard DATENAME and DATEPART functions to get the offsets, both the string representation and the value in minutes:

SELECT DATENAME(tzoffset, @dto_Algeria), DATEPART(tzoffset, @dto_Algeria)
SELECT DATENAME(tzoffset, @dto_India), DATEPART(tzoffset, @dto_India)

+01:00    60
+05:30   330

All the usual date/time functions (DATENAME, DATEPART, DATEADD, DATEDIFF) work with DATETIMEOFFSET in the same way they do with the DATETIME and DATETIME2 datatypes, although you might need to be careful when doing a DATEDIFF between DATETIMEOFFSETs with different offsets. So far, so straightforward — why would anyone have issues working with DATETIMEOFFSETs?

The problems

1. Nomenclature

People mix up offset and timezone, all the time. However, they aren’t synonyms.

SQL Server uses the timezone — e.g. ‘UTC’, ‘Central European Standard Time’, ‘India Standard Time’ — to determine the offset: +00:00, +01:00, +05:30. You can’t get back to the timezone by just knowing the offset. The DATETIMEOFFSET datatype doesn’t store the timezone, just the offset.

Annoyingly, to get the offset using DATENAME/DATEPART, SQL Server lets you abbreviate tzoffset to tz, which leads people to refer to timezone and offset interchangeably.

How do we get the current timezone that SQL Server is using? In modern versions, it’s simply:

SELECT CURRENT_TIMEZONE()

but in older versions, you have to retrieve it from the registry:

DECLARE @TimeZone VARCHAR(50)
EXEC MASTER.dbo.xp_regread
	'HKEY_LOCAL_MACHINE'
	,'SYSTEM\CurrentControlSet\Control\TimeZoneInformation'
	,'TimeZoneKeyName',@TimeZone OUT
SELECT @TimeZone
GMT Standard Time

(taken from here: Get the Current Time Zone of the Server in SQL Server (T-SQL))

2. DST – Daylight Saving Time

The timezone isn’t the only thing that affects the offset — it’s whether DST, Daylight Saving Time, is being applied as well. I chose India and Algeria in the examples above because neither country uses DST. Here in the UK, we ‘gain’ an hour in March and ‘lose’ an hour in October, but not on fixed dates. For example, you can look here When do the clocks change? to see when the next relevant date is. (This period in the UK summer is called British Summer Time, BST). Whether the DATETIMEOFFSET has been determined using DST is not stored in the datatype. If this is relevant to you, you’ll have to keep track of this yourself.

NB: As far as I can tell, all countries in Europe observe DST.

3. SQL Server names the timezones incorrectly

For me, this is the biggest irritant: in SQL Server, the timezone ‘GMT Standard Time’ is not GMT, it is the timezone of the British Isles, i.e. it includes BST. We can show this quite easily:

;WITH cte_Numbers AS
(
	-- Alternative to using a 'Numbers' table:
	SELECT
		TOP 10
		N = ROW_NUMBER() OVER (ORDER BY v1.number)
	FROM master..spt_values v1
	CROSS JOIN master..spt_values v2

), cte_Date AS
(
	SELECT
		N
		,dt = DATEADD(second, N-1, '2021-03-28 00:59:55')
	FROM cte_Numbers

), cte_AtTimeZone AS
(
	SELECT
		*
		,dto = dt AT TIME ZONE 'GMT Standard Time'
	FROM cte_Date
)
SELECT
	N
	,dt
	,dto
	,[tzoffset] = DATEPART(tzoffset, dto)
FROM cte_AtTimeZone
GO
N    dt                      dto                                tzoffset
---- ----------------------- ---------------------------------- -----------
1    2021-03-28 00:59:55.000 2021-03-28 00:59:55.000 +00:00     0
2    2021-03-28 00:59:56.000 2021-03-28 00:59:56.000 +00:00     0
3    2021-03-28 00:59:57.000 2021-03-28 00:59:57.000 +00:00     0
4    2021-03-28 00:59:58.000 2021-03-28 00:59:58.000 +00:00     0
5    2021-03-28 00:59:59.000 2021-03-28 00:59:59.000 +00:00     0
6    2021-03-28 01:00:00.000 2021-03-28 02:00:00.000 +01:00     60
7    2021-03-28 01:00:01.000 2021-03-28 02:00:01.000 +01:00     60
8    2021-03-28 01:00:02.000 2021-03-28 02:00:02.000 +01:00     60
9    2021-03-28 01:00:03.000 2021-03-28 02:00:03.000 +01:00     60
10   2021-03-28 01:00:04.000 2021-03-28 02:00:04.000 +01:00     60

(March 28th was when we entered BST this year — it’s always a Sunday at 1am.) As you can clearly see, GMT in SQL Server terms isn’t proper GMT, because it has taken BST into account. If you want real GMT, you have to convert to the timezone ‘UTC’.

Note that you can convert a DATETIME to a DATETIMEOFFSET using ‘AT TIME ZONE’, but the implicit conversion to DATETIMEOFFSET gives you the version with millisecond resolution, i.e. 3 decimal places; but you can have up to 7 decimal places (see the documentation here datetimeoffset (Transact-SQL) for more information).

It’s not just ‘GMT Standard Time’, other timezones are affected: e.g. “Central European Time (CET) is a standard time which is 1 hour ahead of Coordinated Universal Time (UTC)” (as detailed here), but in fact, SQL Server incorporates ‘Central European Summer Time’ into this timezone to put it an hour ahead in the summer. I wouldn’t be surprised if other timezones are treated in the same way — please check before you use them!

NB: I can see that someone has raised a similar issue here: W. Europe Standard time appears to be incorrect by 1 hour

4. Comparing with DATETIME / DATETIME2

Here’s the biggest issue I’ve come across: comparing DATETIME and DATETIME2 with DATETIMEOFFSET. You have to be very careful, you cannot just mix OFFSET data with non-OFFSET and hope everything’ll be alright – it won’t. This catches people out:

SELECT GETDATE()
SELECT SYSDATETIMEOFFSET()
SELECT CAST(GETDATE() AS DATETIMEOFFSET)
SELECT GETDATE() AT TIME ZONE 'GMT Standard Time'
2021-07-31 18:27:24.187
2021-07-31 18:27:24.1871686 +01:00
2021-07-31 18:27:24.1866667 +00:00
2021-07-31 18:27:24.187 +01:00

Look at the third result: we’ve CAST the current date/time to a DATETIMEOFFSET, but the offset it’s returning us is ‘+00:00’, not the ‘+01:00’ we were expecting (seeing as the server is set to ‘GMT Standard Time’, which is currently an hour ahead of UTC). If we want the correct offset, we either use the syntax from the fourth SELECT statement, ‘AT TIME ZONE…’ or we can add the offset ourselves manually:

SELECT TODATETIMEOFFSET(CAST(GETDATE() AS DATETIMEOFFSET), 60)
2021-07-31 18:27:24.1866667 +01:00

If you see SQL code comparing a DATETIMEOFFSET with a function of GETDATE() or SYSDATETIME(), then it needs thoroughly checking – there’s a good chance it’s wrong!

Summary

Using offsets and timezones takes some getting used to, but I’ve found that as long as I remember not to mix OFFSET and non-OFFSET data and types, then everything’s generally ok. There’s a blog post that could be written about the wisdom of using the DATETIMEOFFSET datatype in the first place — in the few places I’ve seen it used, it was definitely a case of ‘premature globalisation’, assuming the same database was going to be used if/when the company expanded overseas. In fact, for the application I’m working on, the UK and US databases have completely separate databases and code, so it’s a non-issue.

, , , , , , , ,

Leave a comment

Speeding up a FULL OUTER JOIN

Note: I am fully expecting someone to come along and comment that the reason why this works is obvious. That’s fine, at least I’ll learn something!

Usually in my blog posts, I present something and have a reasonable understanding of why that something is happening. Not so here – I’m going to show how I (massively) sped up a query, but I really don’t know why it works.

Some background: I was given a fairly lengthy query to improve, and was quietly confident I could make it more efficient and much faster. Of course, ‘pride comes before a fall’*, and for all my tweaking**, it didn’t make a great deal of difference. Eventually I realised that there was a FULL OUTER JOIN near the bottom of the code, and that was taking up the majority of the processing.

* Turns out that’s from Proverbs 16:18 — it’s always the Bible or Shakespeare, isn’t it.

** A lot of my ‘tweaks’ involve splitting out tables involved in JOINs into temp tables first, which I can then index. Quite often, the performance gains are impressive. Occasionally, they aren’t; c’est la vie!

By the time I finished with the rest of the query, the problematic piece of the code looked like this:

SELECT 
	{...columns ...}
FROM #IndexedTable1 t1
FULL OUTER JOIN #IndexedTable2 t2
	ON t2.IndexedKey = t1.IndexedKey

with about 12,000 records in one table, and 4,000 in the other. Here’s the important bit: the IndexedKey on each table was NULL-able, and a good proportion of the values were NULL. But what if they weren’t NULL, would that speed things up? Turns out the answer is yes…

Worked example

First, we need a couple of tables to join. I’ve manufactured them so that there is some pre-defined overlap by a UNIQUEIDENTIFIER (GUID/UID) key, named KeyID.

DROP TABLE IF EXISTS #MatchingKeys
GO

-- This table contains the UIDs that will be common to both tables
CREATE TABLE #MatchingKeys
(
	MatchingKeyID UNIQUEIDENTIFIER NOT NULL
)
GO

INSERT #MatchingKeys(MatchingKeyID)
	SELECT 
			MatchingKeyID = NEWID()
		FROM dbo.Numbers -- A standard Numbers table
		WHERE Number <= 3000
GO

-- Table One:
DROP TABLE IF EXISTS #TableXOne
GO

CREATE TABLE #TableXOne
(
	TableXOneID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyInt INT NOT NULL
)
GO

INSERT #TableXOne(TableXOneID,KeyID,MyInt)
SELECT
		TableXOneID = ROW_NUMBER() OVER (ORDER BY NEWID())
		,KeyID
		,MyInt = CAST(CAST(NEWID() AS VARBINARY(16)) AS INT)
	FROM
	(
		SELECT
				KeyID = MatchingKeyID
			FROM #MatchingKeys
		UNION ALL
		SELECT
				KeyID = CASE WHEN Number <= 3000 THEN NEWID() ELSE NULL END
			FROM dbo.NumbersTest 
			WHERE Number <= 57000
	) x
GO

-- Table Two:
DROP TABLE IF EXISTS #TableXTwo
GO

CREATE TABLE #TableXTwo
(
	TableXTwoID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyDate DATETIME NOT NULL
)
GO

INSERT #TableXTwo(TableXTwoID,KeyID,MyDate)
SELECT
		TableXTwoID = ROW_NUMBER() OVER (ORDER BY NEWID())
		,KeyID
		,MyDate = DATEADD(ms, CAST(CAST(NEWID() AS VARBINARY(16)) AS INT), GETDATE())
	FROM
	(
		SELECT
				KeyID = MatchingKeyID
			FROM #MatchingKeys
		UNION ALL
		SELECT
				KeyID = CASE WHEN Number <= 3000 THEN NEWID() ELSE NULL END
			FROM dbo.NumbersTest 
			WHERE Number <= 57000
	) x
GO

A quick look at our data:

SELECT COUNT(1) AS N, SUM(IIF(KeyID IS NOT NULL, 1, 0)) AS NumKeys FROM #TableXOne
SELECT COUNT(1) AS N, SUM(IIF(KeyID IS NOT NULL, 1, 0)) AS NumKeys FROM #TableXTwo
GO

N           NumKeys
----------- -----------
60000       6000

N           NumKeys
----------- -----------
60000       6000
SELECT TOP 100 * FROM #TableXOne ORDER BY TableXOneID
SELECT TOP 100 * FROM #TableXTwo ORDER BY TableXTwoID
GO
TableXOneID KeyID                                MyInt
----------- ------------------------------------ -----------
1           NULL                                 1865912662
2           NULL                                 1357143369
3           NULL                                 1464599252
4           NULL                                 1142728281
5           NULL                                 -629295448
6           NULL                                 779829015
7           BC02CED1-59E5-4F53-B3F0-5BAD4C92E9AE -291591389
8           NULL                                 -864066744
9           NULL                                 1026552142
10          NULL                                 1896807348
...
TableXTwoID KeyID                                MyDate
----------- ------------------------------------ -----------------------
1           NULL                                 2021-06-21 07:45:30.263
2           NULL                                 2021-06-14 18:55:52.193
3           NULL                                 2021-06-17 06:02:06.110
4           NULL                                 2021-06-30 06:46:41.193
5           NULL                                 2021-07-08 21:06:37.897
6           NULL                                 2021-06-30 22:44:31.940
7           NULL                                 2021-07-04 19:09:34.523
8           NULL                                 2021-06-16 10:58:35.430
9           DD67103D-A849-47BE-B04F-F96D3852E02F 2021-06-17 15:46:03.147
10          NULL                                 2021-06-24 18:13:44.010
...

(MyInt and MyDate are irrelevant, they’re just there to take up space.) Let’s look at some counts:


SELECT
	[# same] = (
		SELECT COUNT(1)
			FROM #TableXOne t1
			JOIN #TableXTwo t2
				ON t2.KeyID = t1.KeyID
	
	)
	,
	[# FOJ] = (
		SELECT COUNT(1)
			FROM #TableXOne t1
			FULL OUTER JOIN #TableXTwo t2
				ON t2.KeyID = t1.KeyID
	)
GO

# same      # FOJ
----------- -----------
3000        117000

As per our design, there are 3,000 records where the same KeyID (UID) exists in both tables, and 117,000 results if we did a FULL OUTER JOIN (that is, 60000 + 60000 – 3000 = 117000).

Ok, so now we’ll look at some performance stats:

DROP TABLE IF EXISTS #ResultsA
GO

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	INTO #ResultsA
	FROM #TableXOne t1
	FULL OUTER JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID
GO
Table '#TableXTwo___...'. Scan count 1, logical reads 277, physical reads 0 [...]
Table '#TableXOne___...'. Scan count 1, logical reads 246, physical reads 0 [...]

SQL Server Execution Times:
  CPU time = 60390 ms,  elapsed time = 65435 ms.

Ouch, over a minute. (This is on a PC with zero load.) Let’s try an alternative formulation of the query that should give us identical results:

DROP TABLE IF EXISTS #ResultsA2
GO

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	INTO #ResultsA2
	FROM #TableXOne t1
	JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID

UNION ALL

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	FROM #TableXOne t1
	LEFT JOIN #TableXTwo t2
		ON t2.KeyID = t1.KeyID
	WHERE t2.KeyID IS NULL

UNION ALL

SELECT
		t1.TableXOneID
		,t1.KeyID AS KeyID1
		,t1.MyInt
		,t2.TableXTwoID
		,t2.KeyID AS KeyID2
		,t2.MyDate
	FROM #TableXTwo t2
	LEFT JOIN #TableXOne t1
		ON t1.KeyID = t2.KeyID
	WHERE t1.KeyID IS NULL
GO

Table '#TableXTwo___...'. Scan count 3, logical reads 831, physical reads 0 [...]
Table '#TableXOne___...'. Scan count 3, logical reads 738, physical reads 0 [...]

SQL Server Execution Times:
  CPU time = 62047 ms,  elapsed time = 68365 ms.

Effectively the same result, just the expected increase in scans/logical reads. We’ll quickly check the results are identical:

SELECT * FROM #ResultsA EXCEPT SELECT * FROM #ResultsA2
GO
SELECT * FROM #ResultsA2 EXCEPT SELECT * FROM #ResultsA
GO
-- Empty recordsets, the tables are the same.

This is a very slow query — how can we speed it up?

The hack

Firstly, we’ll take copies of our two tables (naming them Y instead of X); and we’ll (a) replace NULL entries in KeyID with a ‘fake’ GUID, and (b) keep track of this with a new flag, IsFakeID.

DROP TABLE IF EXISTS #TableYOne
GO

CREATE TABLE #TableYOne
(
	TableYOneID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyInt INT NOT NULL
	,IsFakeID BIT NOT NULL
)
GO

DROP TABLE IF EXISTS #TableYTwo
GO

CREATE TABLE #TableYTwo
(
	TableYTwoID INT NOT NULL PRIMARY KEY CLUSTERED
	,KeyID UNIQUEIDENTIFIER NULL
	,MyDate DATETIME NOT NULL
	,IsFakeID BIT NOT NULL
)
GO

INSERT #TableYOne(TableYOneID, KeyID, MyInt, IsFakeID)
SELECT
		TableXOneID
		,KeyID = ISNULL(x.KeyID, NEWID())
		,x.MyInt
		,IsFakeID = CASE WHEN x.KeyID IS NOT NULL THEN 0 ELSE 1 END
	FROM #TableXOne x
GO

INSERT #TableYTwo(TableYTwoID, KeyID, MyDate, IsFakeID)
SELECT
		TableXTwoID
		,KeyID = ISNULL(x.KeyID, NEWID())
		,x.MyDate
		,IsFakeID = CASE WHEN x.KeyID IS NOT NULL THEN 0 ELSE 1 END
	FROM #TableXTwo x
GO

Now let’s do the FULL OUTER JOIN on these tables, and save the result. Note that in the SELECT, if our KeyID is ‘fake’, we put the NULL back as it was originally.

DROP TABLE IF EXISTS #ResultsB
GO

SELECT
		t1.TableYOneID
		,IIF(t1.[IsFakeID] = 1, NULL, t1.KeyID) AS KeyID1
		,t1.MyInt
		,t2.TableYTwoID
		,IIF(t2.[IsFakeID] = 1, NULL, t2.KeyID) AS KeyID2
		,t2.MyDate
	INTO #ResultsB
	FROM #TableYOne t1
	FULL OUTER JOIN #TableYTwo t2
		ON t2.KeyID = t1.KeyID
GO
Table '#TableYTwo___...'. Scan count 1, logical reads 142, physical reads 0 [...]
Table '#TableYOne___...'. Scan count 1, logical reads 128, physical reads 0 [...]

SQL Server Execution Times:
 CPU time = 47 ms,  elapsed time = 58 ms.

Quite the speed up: from over a minute to 58 milliseconds! Let’s check the results are the same:

SELECT * FROM #ResultsA EXCEPT SELECT * FROM #ResultsB
GO
SELECT * FROM #ResultsB EXCEPT SELECT * FROM #ResultsA
GO
-- Empty recordsets, the tables are the same.

Unfortunately, my knowledge of Execution Plans isn’t good enough to determine why the huge difference exists; as far as I can tell, the plans are identical (save an extra ‘Compute Scalar’ step in the ‘hack’ version, due to putting the ‘real’ NULLs back). The estimated number of rows for the Hash Match in the ‘hack’ version are about half the final number, which may be a clue.

Even more curiously..?

In my original work, the common keys were (already) indexed; in the example above, they’re not. If I add indexes on the key to all the tables, then the ‘non-hack’ version doesn’t use them anyway — maybe that’s not surprising, it’s sparse data. If I force the query to use the indexes, using WITH(INDEX(…)), then the query takes longer than I have patience for (it didn’t finish within 5 minutes). If I force indexes on the ‘hack’ version, then it takes about a second, but there are a lot more reads.

Summary

It’s pretty simple: this ‘hack’ works, I can show it works (it gives identical results to the ‘non-hack’ version), and it sped up the query greatly. But I don’t know why it works — moreover, why couldn’t SQL Server be doing this under the hood? (There must be a reason!) When I find out, I’ll put an update here.

, , ,

Leave a comment