sqlpete

This user hasn't shared any biographical information

Homepage: https://sqlpete.wordpress.com

Determining SQL datatypes from wide CSV data

Thirty second summary

When importing a CSV file, SQL Server’s Import/Export wizard isn’t great at working out the correct column datatypes for the new table — which is especially painful when the CSV is wide (has many columns) — so I wrote some SQL code to get round the issue.

 

Cartoon of a man looking up at a huge spreadsheet

Everyone working in BI / data science has to move data around between apps all the time, and it’s a constant surprise to me how this is still somewhat of a pain. In particular, getting CSV data into a SQL Server table with the datatypes you require is a very hit-and-miss affair. The GUI tools SSMS provides you with aren’t really up to the task, especially when the CSV files are very wide.

In this post, I’ll demonstrate how I tackled this problem.

 

Motivation

Warning: This section is quite waffle-y, please feel free to skip

In 2018, the consumer lender Home Credit ran a competition on the data science community site kaggle, to see who could come up with the best predictive model for a given data set: “The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan.” All the details are given here.

The competition closed 3 months later with a whopping 7180 entries! The winning team, comprising 6 senior data scientists, won $35,000. In discussion about their winning code, they wrote: “… for this kind of competition and this kind of problem, feature engineering and feature selection are still the most important step.”

Feature engineering means manipulating the data you have to generate more predictive variables, and it’s hands down my favourite part of building models — it’s basically playing with data, using as much creativity as you can muster. Some examples from my own work:

  • Parsing business names and addresses to tease out company-level data that didn’t otherwise exist
  • Sub-models of ‘customer enquiry’ data to add a much richer ‘search score’ (how ‘urgent’ the customer’s requirement for the money is)
  • Combining socio-demographic statuses to paint clearer pictures of applicants
  • Augmenting residential statuses with given electoral roll data to better explain a customer’s living situation
  • Generating huge numbers of ratios concerned with debt, income, expenses, and all aspects of affordability

It’s then straightforward to weed out those variables that aren’t predictive (part of feature selection).

Having some time on my hands, I decided to play with this dataset, and see how close I could get to the performance of the winning teams. I’m not completely daft — I’m not expecting much, some (all?) of the winning team are PhD ‘Heads of Data Science’ at huge companies. Still, it seems like a fun thing to try. Stay tuned for a future post on my efforts!

Cartoon of a man cuddling a monitor with SQL displayed on it

SQL Is Best

Now, because I’ve used it, day in and day out, for so many years, I’m accustomed to getting to grips with my data, trying to really understand it, in SQL Server — not Python, or even R. So, I want to get the data into SQL Server first. And here’s the stumbling block: we have multiple CSV files, some of them with over a hundred columns.

Anyone who uses SSMS/SQL Server knows that the Import Wizard is… let’s be polite, ‘lacking in features’. Or less polite, bloody annoying. For me, a far more flexible and fruitful approach is to import all the data as text ([N]VARCHAR), and do all our transformations using pure SQL.

The first CSV I’m looking at (application_train.csv from the competition dataset) has 122 columns, and while I could work out the datatypes manually, it’s more fun to write code to do it for me!

 

My solution

Let’s get the data into SQL Server as text.

Getting the raw data into SQL Server: BULK INSERT

I copied/pivoted the column names from the first row of the CSV, and created a SQL table app_train_raw that consisted of one integer column (the unique customer ID), and 121 VARCHAR(50) columns. I then inserted the data using BULK INSERT:

BULK INSERT dbo.app_train_raw
FROM '...\application_train.csv'
WITH (
   FORMAT = 'CSV'
   ,FIRSTROW = 2 -- i.e. skip header row
   ,ROWTERMINATOR = '0x0a' -- character 10, Line Feed
)

It took a few seconds to insert the 307,511 rows, a total of 37,516,342 data points.

Warning: BULK INSERT doesn’t give good error messages. I kept getting an error about ‘conversion error (truncation)’, so spent ages messing with text encodings, trying to find rogue characters in the CSV — when in fact, I’d miscounted the number of columns!

The code

I’m not going to present the entirety of my code, because (a) this post would be too long, (b) it’s just ‘glue’ code. I maintain a config table with one row per column from the original CSV, and whether I’ve already determined the data type, or created a lookup. Also, I’m doing this all iteratively: identifying the types of columns as I go along, then re-running sections of code, ignoring those with known types.

From here on, I’ll post only the most relevant bits.

Initial categorisation

In my post Aggregating all the categorical data in a table, there’s code I wrote to, well, aggregate all the categorical data from a table (in order to calculate the Population Stability Index for each variable in the models).

We can use the exact same code here, and run it over all 121 VARCHAR(50) columns. You might think it would take a long time to GROUP all these columns, but it took just 19 seconds on my old PC. We’re left with the global temp table ##Results, containing 388,218 rows: each row is a distinct variable/value pair, we’re not interested in duplicate values.

Analysing the categories

The first thing we can do is look for obvious candidates for columns that we can ‘normalise out’ as lookup tables:

SELECT [var], COUNT(1)
FROM ##Results
GROUP BY [var]
HAVING COUNT(1) < 30

This led me to 13 columns (mostly ending _STATUS, _TYPE) that I turned in to lookup tables. This didn’t take long, and could easily be automated.

Next, let’s gather some information about the remaining columns (I’d already removed the 13 lookup columns from ##Results):

SELECT 
 [var]
 ,HasNULL     = SIGN(SUM(IIF([value] IS NULL, 1, 0)))
 ,NumCats     = COUNT(1)
 ,NumNumeric  = SUM(ISNUMERIC([value]))
 ,MinVal      = MIN([value])
 ,MaxVal      = MAX([value])
 ,MinLen      = MIN(LEN([value]))
 ,MaxLen      = MAX(LEN([value]))
 ,NumWithDP   = SUM(CASE WHEN CHARINDEX('.',[value]) > 0 THEN 1 ELSE 0 END)
 ,NumWithSN   = SUM(CASE WHEN PATINDEX('%[0-9]e[+-][0-9]%',[value]) > 0 THEN 1 ELSE 0 END)
 ,NumWithNeg  = SUM(CASE WHEN LEFT([value],1) = '-' THEN 1 ELSE 0 END)
FROM ##Results
GROUP BY [var]

The code is fairly self-explanatory: var is the column of interest, and we’re counting the number of distinct instances of values where the value is numeric, or contains a period (decimal point), or scientific notation (‘Xe-Y’), or a minus sign at the start. We also want to know the min/max alphabetic values (for identifying 0/1 or N/Y), and the min/max length of the values (for deciding between FLOAT and DECIMAL).

Using this result set, we can say:

  • If there are only 2 categories (or 3 including NULL), and the min/max values are 0/1 or N/Y, then we’ll treat this column as a BIT
  • If there are no non-numeric values, and any string contains scientific notation, or contains a decimal point and is longer than 18 characters, we’ll treat this as a FLOAT
  • If there are no non-numeric values, and no decimal points, then we’ve either got a BIGINT, INT, SMALLINT or TINYINT (not forgetting that TINYINTs can’t be negative, so we check that the first character isn’t a minus sign

This takes care of over half the columns.

Numeric types

We’ve already determined some of the numeric types, but we need to investigate further. Mainly, we’re concerned with identifying the correct precision/scale parameters for DECIMAL types. Note that the first CTE only considers numeric columns we haven’t identified the correct types for yet.

;WITH cte_UnknownOnly AS
(
SELECT
   r.[var]
   ,floatval = CAST(r.[value] AS FLOAT)
   ,lp       = LEFT(r.[value], CHARINDEX('.', r.[value])-1)
   ,rp       = SUBSTRING(r.[value], 1+CHARINDEX('.', r.[value]), 255)
 FROM ##Results r
 WHERE r.var IN (... {unknown numeric columns only}... )
)
SELECT
  r.[var]
  ,MinAsFLOAT = MIN(r.floatval)
  ,MaxAsFLOAT = MAX(r.floatval)
  ,MinLenLP   = MIN(LEN(r.lp))
  ,MaxLenLP   = MAX(LEN(r.lp))
  ,MinLenRP   = MIN(LEN(r.rp))
  ,MaxLenRP   = MAX(LEN(r.rp))
  ,NumDistRP  = COUNT(DISTINCT r.rp)
  ,MinINTLP   = MIN(CAST(r.lp AS INT))
  ,MaxINTLP   = MAX(CAST(r.lp AS INT))
  ,MinINTRP   = MIN(CAST(r.rp AS INT))
  ,MaxINTRP   = MAX(CAST(r.rp AS INT))
FROM cte_UnknownOnly r
GROUP BY r.[var]

where LP refers to the piece on the left of the decimal point (the integer part), and RP is the piece on the right (the fractional part).

From this result set, it’s easy to work out the correct parameters for DECIMALs using MaxLenLP and MaxLenRP. NumDistRP is useful for deciding whether to simplify some types, e.g. salaries are usually given as whole numbers, but a few may contain fractional parts due to calculations, and you may wish to round the figures for simplicity.

This step took care of the rest of the columns — we now know the correct datatypes for the whole dataset.


Wrap-up

I hear you say: “This looks like a lot of work, I can eyeball the column types faster and just use the wizard.” Good luck to you, I say — I’ve spent far too many hours of my life looking at unhelpful error messages from the Import wizard. And what if you’ve got multiple CSVs with over a thousand columns in each? It’d take days to get all the types correct. This code gives me all the control over the datatypes I could want. (Consider also that I generate the INSERT/SELECT statements from the same config table; I’ve omitted the details here to save space.)

Could I ‘product-ise’ this to make it a single piece of code for anyone to run? Sort of, but not really. There are so many edge cases and ‘quirks’ to cater for. As an example here, the columns prefixed with ‘AMT_REQ_CREDIT_BUREAU’ are presented in the data as having one decimal place, but in reality, they’re all integers (as they’re counts of enquiries).

Note that if the CSV file is too big, you could restrict the categorisation to the first 1000 rows (say). This would probably be ok for well-behaved datasets… but when was the last time you got a well-behaved dataset to work on?


Finally, an anecdote: I once was involved in a project to continuously transfer data from one internal system (Microsoft Dynamics) to another (our SQL Server instance) — the ‘official’ Microsoft connector was unreliable, and flaked out several times a week. The company chose a third-party provider that had a product they claimed could transfer data between hundreds of different systems, and did exactly what we needed — or so we thought. Turns out, after we built our import layer, the product only transferred the data stripped of all formatting, i.e. we received only strings into the target system, whether the original data had been numbers, dates or text. Much swearing ensued! We couldn’t believe this was correct, and the provider was surprised that we were annoyed (“that’s how everyone does it” — is that true? Let me know if you think so…), and we then had to write a ton more code to convert the strings back to the datatypes we needed [which was where I first realised you could use SQL to work out the types]. The lesson I take away is: always check even your most utterly basic assumptions!

, , , , , , , ,

1 Comment

Diary Entry #1

A cartoon of me writing in my diary

If there’s one thing I hate, it’s not being busy. I can’t really switch off, I’m always thinking about tech I could be playing with. In that vein, here are a few of the diversions that have been occupying my time since my last post on here.

Warning: rambling ahead!

Poking My Nose Into Other People’s Business

A cartoon of an elephant operating a computer

As a boy, I drove my parents nuts by forever taking stuff apart. There wasn’t an electrical or electronic appliance in the house that I hadn’t had in pieces at some point, and I wasn’t the best at putting them back together. My bedroom floor was strewn with circuit boards, motors, lights, speakers, batteries and general metal junk. Because I’m a grown-up and can’t do that any more (bah), I like to pick apart other people’s code and algorithms, and see how they work.

Lending Calculators

A while ago, one of my projects at work was to figure out how the lending calculators for various car finance firms worked – or in a couple of cases, didn’t work.

Anyway, I found this a great deal of fun, so I considered starting a side blog detailing all my efforts in this area. However, I ran the idea past a friendly compliance person (Hi Debs!) who advised against it: big companies don’t like to be shown to be wrong, and have expensive lawyers to set on anyone who implies otherwise. (Yes, I could still do it if I anonymised the companies involved — I might still do that, but it’s extra work with screenshots, etc.)

[Aside: There was one particular online calculator, not affiliated to a particular company, but it was flexible enough that it covered the ways in which car finance is different to usual consumer lending. I played with it for days, but could not replicate its results. In the end, the CFO wrote to the author of the calculator, and paid good money for a copy of the code. Not only was the code bad, it was quite wrong, provably so. No wonder I couldn’t figure the numbers out!]

A few weeks ago, I idly started playing with the payment calculators on various lending sites. One in particular caught my eye, because it had a unique feature to it (which I’m not going to expand on, or it gives the company away). I realised that, behind-the-scenes, the website called a publicly-accessible API, so I coded up loops in SQL to automatically query the API with random values, and then analysed the results:

  • The code on the website that called the API had at least two minor bugs, which meant changes to the UI elements weren’t reflected in the calls to the API.
  • The JSON returned by the API call had some… interesting variable naming. I expect the code was imported from the US.
  • The figures were being rounded, intra-calculation, in a way I’d not seen before.
  • I could only replicate the results about 95% of the time — the other 5% only made sense if the interest rate was different to what was stated.

I’ve no idea how my investigation would be received if I emailed the company about my findings. I’ll keep it under my hat for now.

LoanSims

A cartoon of an older lady as a robot

It’s been at the back of my mind for years to create a lending company simulator: that is, starting with a budget of X, lend out Y new loans per day (according to some distribution of terms/amounts), collect payments from pre-existing loans (subject to parameters relating to propensity to pay back), and track the performance of the company day-by-day. As part of my long-term goal to make my ‘go-to’ coding language Python, I wrote LoanSims, my first pass at this app (named for legendary ‘Carry On’ actress Joan Sims).

I was flush with enthusiasm for it, but then spoke to a friend who has much more experience in the industry — he said he’d tried in the past to sell something similar to lenders, but weirdly there was little-to-no interest. Which dampened my excitement — there’s no point spending ages on code no-one will ever want, even if it is stimulating, academically, to build it. Onto the back-burner it goes!

Sus(s) Udio?

Udio is the latest AI music generator, easily stealing the crown from Suno. From their About Us page:

Udio builds AI tools to enable the next generation of music creators. We believe AI has the potential to expand musical horizons and enable anyone to create extraordinary music.

udio.com allows users to create music from simple text prompts by specifying topics, genres, and other descriptors which are then transformed into professional quality tracks.

At time of writing, I have generated 315 pieces of music using Udio — it’s currently in beta, and free (there are daily limits, but I’ve not hit them yet). You start off by giving a text prompt — e.g. “A song about beating the devil at poker, then drinking his whiskey and stealing his horse. Nineties shoegaze” — which returns two 33 second pieces of music. You can then extend either piece of music (choosing between intro / before / after / outro), or ‘remix’ it — then keep repeating the process.

A cartoon of a workman playing a keyboard and singing

The results are stunning. Eight times out of ten, the music is indistinguishable from a genuine track. It’s not perfect, and the auto-generated lyrics are cringey — but the vocals are so strong, it doesn’t seem to matter so much. (And you can provide your own lyrics!)

Because it’s one of my favourite periods, I’ve been generating music that sounds like 1970s art-rock, and while I can’t get anything that sounds close to David Bowie (you can’t use real artist names in the prompts), it’s pretty easy to get vocals that resemble Russell Mael (from Sparks), Jon Anderson (from Yes), or even Peter Gabriel or Phil Collins! (Although, so far, not Ian Anderson from Jethro Tull.)

I’ve had trouble with it not adhering to the prompt, e.g. trying to get a Eurovision-sounding song from the 1970s, it gives me 80s metal, or 2000s pop. But because it only takes a few seconds to re-generate a new piece of music, you can just try again.

It really is quite addictive. I know some people out there will be horrified at the idea of it: “music is art, it’s not the domain of algorithms”. Really, I have no answer; I have a houseful of keyboards and guitars, and I don’t feel it’s taking away from my enjoyment of making my own music. But then again, I don’t earn my living from making music.

Other ‘stuff’

  • Applying for jobs: It seems it’s not a good time to be applying for tech jobs: it’s very difficult to get recruiters to call or email back, I’ve not had a single acknowledgement for any role I’ve applied for, whether directly or through a recruitment site, or LinkedIn. Trying times!
  • AI not coming for us — yet: A few weeks ago, a story did the rounds about an AI developer, ‘Devin’, that looked like it could replace a whole software engineer; luckily, the truth was rather less sensational. This video is well worth 25 minutes of anyone’s time: Debunking Devin: “First AI Software Engineer” Upwork lie exposed!.
  • Pete Tonkin, Hardware Guy: I am not a hardware guy. Yes I build my own computers, but inside they are a MESS. However, I had to replace the screen on my son’s laptop recently, and it went smoother than these things usually do! I’ll take the win.
  • General file admin: I have 25 years of hard drives, CD-Rs and DVD-Rs with umpteen projects in various states of completion, so I’ve tasked myself with collating all the files, with a view to creating ONE single repository of all my work. It’s a nice goal to have, but realistically the chances of me completing the task are slim-to-none, because something more interesting is always just around the corner…

I might do more of these ‘diary entries’, if only to remind myself what it is I’ve been doing, and what tech I’ve been playing around with!

, , , , ,

Leave a comment

Converting R code to Python: A Test

Background

I’ve been using R as my statistics programming environment for decades. In fact, I started out using its predecessor, an earlier (non-open source) program called S. In the last 10 or so years, the role of data scientist has appeared, and in the main, data scientists use the Python programming language, which comes with its own many statistical libraries. In terms of statistics, there’s a huge overlap in the capabilities of R and Python — pretty much anything you can do in R, you can do in Python, and vice-versa. Of course, Python is also used everywhere these days in places you wouldn’t consider using R, e.g. building apps or websites.

In 2020, I completed a few Python modules at DataCamp, but haven’t needed to use it much since. However, Python is pretty much the #1 requirement when it comes to applying for data science positions, so I need to get comfortable with it.

Where I have my own (fairly extensive) library of R functions for various modelling, credit risk and maths purposes, it’d be nice to port all that code to Python — and even nicer if it could be done with as little effort as possible on my part! 😀 So let’s see how we get on with a small-ish function.

The function in R

In the 90s, my MSc dissertation was concerned with exact tests in contingency tables: see the wiki entry Fisher’s exact test for more details. In the PhD thesis of Dr Danius Michaelides, Exact Tests via Complete Enumeration: A Distributed Computing Approach, he describes an algorithm that enumerates integer partitions, subject to constraints (see Section 2.3.2 and Figure 2.2). This algorithm is used in the enumeration of contingency tables with fixed margins.

I’ve recreated this algorithm from his description. The function partitions_with_constraints takes two parameters: v, an array of maximum values, and N, the integer to be partitioned; and returns a matrix of all the possible partitions.

partitions_with_constraints <- function(v, N)
{
  vl = length(v);
  stopifnot( vl >= 2 );
  stopifnot( all(v > 0) );
  stopifnot( N <= sum(v) );

  cells = rep(NA,vl);

  res = c();

  i = 1;

  while (i > 0)
  {
    allocated = if(i==1) 0 else sum(cells[1:(i-1)]);
    remaining = N - allocated;
    
    if (i == vl) {

      # We're at the last cell,
      # which can only be the amount remaining.
      # Add the cells array to our results.

      cells[i] = remaining;
      res = rbind(res, cells);
      i = i - 1

    } else {

      if (is.na(cells[i])) {
        # Set the cell to the minimum possible
        cells[i] = max(0, remaining - sum(v[(i+1):vl]));
      } else {
        # Otherwise, increment
        cells[i] = cells[i] + 1
      }
      cell.max = min(remaining, v[i]);

      if (cells[i] > cell.max) {
        # We've hit the maximum; blank the cell, and move back one
        cells[i] = NA;
        i = i - 1;
      } else {
        i = i + 1;
      }
    }
  }

  # Restore the indices on the rows:
  res = matrix(res, nrow=dim(res)[1], byrow=F);
  return(res);
}

If we run it using the example from the thesis, we get:

partitions_with_constraints(c(3,1,3),4);
     [,1] [,2] [,3]
[1,]    0    1    3
[2,]    1    0    3
[3,]    1    1    2
[4,]    2    0    2
[5,]    2    1    1
[6,]    3    0    1
[7,]    3    1    0

To explain the output in words: there are 7 ways in which we can partition the number 4 into sums of three numbers, where the first number is between 0 and 3, the second number is between 0 and 1, and the third number is between 0 and 3.

Converting the code with Copilot

For the conversion, I used Microsoft Copilot. (There is a ‘pro’ version, but I’m using the free website version.) If you google Copilot, it seems to be referred to mainly as a ‘chatbot’, but that term feels rather inadequate — so I asked Copilot how to describe itself! It said “Microsoft Copilot is an AI companion designed to assist users by providing information, answering questions, and engaging in conversation.” An ‘AI companion’ is a better description, I think.

Into the ‘Ask me anything’ box, I wrote “Please convert this R function into Python:” and pasted in the code from above. Within a few seconds, it returned the Python code you see below.

The function in Python

Here’s the code that Copilot returned:

import numpy as np

def partitions_with_constraints(v, N):
    vl = len(v)
    assert vl >= 2
    assert all(i > 0 for i in v)
    assert N <= sum(v)

    cells = [None]*vl
    res = []

    i = 1

    while i > 0:
        allocated = 0 if i == 1 else sum(cells[0:i-1])
        remaining = N - allocated

        if i == vl:
            cells[i-1] = remaining
            res.append(cells[:])
            i -= 1
        else:
            if cells[i-1] is None:
                cells[i-1] = max(0, remaining - sum(v[i:vl]))
            else:
                cells[i-1] += 1
            cell_max = min(remaining, v[i-1])

            if cells[i-1] > cell_max:
                cells[i-1] = None
                i -= 1
            else:
                i += 1

    return np.array(res)

It also output some additional text, describing how the primitive function calls are translated: This Python function should behave similarly to your R function. Please note that Python uses 0-based indexing, unlike R which uses 1-based indexing. This means that some adjustments were needed in the translation. Also, Python does not have a direct equivalent to R’s NA, so None is used instead. The rbind function in R is replaced with list append and conversion to a numpy array in Python. The stopifnot function in R is replaced with assert in Python. The all and sum functions in Python behave similarly to their counterparts in R. The length function in R is replaced with len in Python. The rep function in R is replaced with list multiplication in Python. The max and min functions in Python behave similarly to their counterparts in R.

For me, the most impressive thing is the conversion of arrays from 1-based (i.e. arrays are indexed starting from 1 in R) to 0-based (as in Python, like e.g. C/C++); that wasn’t specified anywhere in the instructions, it just ‘knew’ to do it.

I often use one of the following online sites to run Python (if the code isn’t too big):

I used the first one, pasting the code in, plus the call to the function with the same parameters as in the R version:

print(partitions_with_constraints([3,1,3],4))
[[0 1 3]
 [1 0 3]
 [1 1 2]
 [2 0 2]
 [2 1 1]
 [3 0 1]
 [3 1 0]]

which you can check is the same data as the R version, only the output formatting is different.

Full disclosure: the code didn’t run first time — I’d made a tiny mistake in one of my assertions, that Copilot just translated without picking up on it. And when it did run, there was no output, because I’d forgotten the print(…) around the function call. But these were very minor issues, fixed in seconds.


Copilot generating Python from a description

I want to present another instance of Copilot helping with coding; in this case, not a conversion, but coding an algorithm from a description of what the algorithm should do.

As I’ve mentioned before (in the post Practical queries using binary operations), I occasionally toy around with a Sudoku solver I’ve written in SQL. There’s one piece of cell-checking code I haven’t written yet, so could Copilot write it for me?

My prompt:“I am writing a Sudoku solver. For each remaining cell, I have a list of possible values for that cell. For example, I have a row where four cells have been filled in: 1,4,6 and 8. The remaining possible cell values are (3,7),(2,3,5),(2,3,9),(5,9),(5,7,9) , where parentheses delimit cells. What algorithm can I use to find all possible combinations of cell values? E.g. for my example here, 3,2,9,5,7 is a possibility.”

Copilot wrote: You can use a recursive backtracking algorithm to find all possible combinations of cell values. Here’s a Python function that does this:

def solve_sudoku(possible_values, index=0, solution=[]):
    if index == len(possible_values):
        print(solution)
        return

    for value in possible_values[index]:
        if value not in solution:
            solve_sudoku(possible_values, index + 1, solution + [value])

possible_values = [(3,7),(2,3,5),(2,3,9),(5,9),(5,7,9)]
solve_sudoku(possible_values)

Looks promising… here’s the output when I run the code at Online Python:

[3, 2, 9, 5, 7]
[3, 5, 2, 9, 7]
[7, 2, 3, 5, 9]
[7, 2, 3, 9, 5]
[7, 3, 2, 5, 9]
[7, 3, 2, 9, 5]

It worked, first time.

Then, I asked if it could convert the code to SQL:

SQL is not designed for this type of recursive computation... That being said, you can use recursive Common Table Expressions (CTEs) in SQL for some types of recursive problems. However, the problem you’re trying to solve involves backtracking, which is not straightforward to implement in SQL.

Oh well, you can’t win them all! [I’m not entirely sure how I’m going to convert it to SQL, but the Python code has given me some ideas, so that’s better than nothing.]


Final thoughts

This all worked much better than I thought it would, and the odds are very good that I can convert all my R code to Python without too much work. Sure, the algorithms above aren’t that complicated, but none of my R code is that complicated; most of the functions are wrappers around calls to R libraries. If direct conversion doesn’t work, I can try asking Copilot to recreate the code from a text description of the input/output.

, , , , , , , ,

Leave a comment

Getting to grips with missing pennies

A sad penny

Over the past few years, I’ve created an awful lot of amortization schedules for loans, by hand and via code. As I read around the subject, I kept seeing references to ‘missing pennies’, but never really understood what was being written about. The way I do my calculations, I don’t get any ‘missing pennies’!

At last, with the help of Microsoft Copilot, I finally understand. I asked Copilot to look at the webpage Business Math: 13.3 AMORTIZATION SCHEDULE, and explain the ‘missing pennies’ section, and it did just that, first time.

As a result, I’ve written a document explaining the process:

Click here for the PDF document ‘Explaining Missing Pennies in Amortization Schedules’

So why hadn’t I understood it before now? Because I think about loan schedules as a programmer, and ‘missing pennies’ are only a thing if you do all the rounding at the same time, and it wouldn’t cross my mind to do that.

, , , , , ,

Leave a comment

Looking for new opportunities

A man looking for information in a sea of chaotic data

I’m currently looking for new and interesting work opportunities — if you like my blog and think my many years of experience could benefit your company, please drop me a line! My email address is pete.tonkin@gmail.com.

Leave a comment

Applying an XSLT to XML in SQL Server

XML, XSD and XSLT

I’ve mentioned a few times how much I like working with XML inside SQL Server. Using XML, you can store complicated formats that would be a real pain to replicate as standard normalised tables. However, there are a couple of limitations:

  • You can’t validate XML ‘on the fly’ using XSDs; an XSD is an XML Schema Definition (XSD), a way to “describe and validate the structure and content of an XML document”. You can create XML schema collections in SQL Server, but they’re created and modified like other static objects (e.g. tables, functions, sprocs); it’d be nice to have something more dynamic. I’ve covered XSDs before: XSDs FTW.
  • You can’t apply XSLTs (eXtensible Stylesheet Language Transformations) which would allow you to transform XML from one structure to another. You can achieve some limited transformation by manipulating individual XML nodes, but it’s messy and long-winded. XSLTs are nicer and built for the job.

Both XSDs and XSLTs are very standard tech, nearly 30 years old, and supported nearly everywhere. (Well, not natively in SQL Server!)

So why do I want to be able to use XSLTs and XSDs?

Background

In the dim and distant past, I was one of a small team of data scientists for a finance company that owned and managed several lenders. We had to deal with the output of several existing scorecards and rulesets (as well as building new and better ones), and, frankly, our databases got messy very quickly, and started to be a barrier to getting things done.

I had an idea: to consolidate all the scorecards (and input/output), using an XML framework within our SQL Server instance. XML manipulation would be needed, for which we could elegantly use XSLTs. I presented a working model of this idea to the CTO… and he didn’t like it. 😦

Luckily, I’d been contacted by the CTO of another lending company who wanted me to go and work for them; they had been assessing potential customers solely via rulesets (very common for start-up lenders at the time), and they needed ‘proper’ credit risk scorecards, which I’ve been building for years. But, crucially, they also needed a mechanism to process the application data through the scorecards — and I knew I could use my XML/XSLTs set-up for this.

The Scorecard Engine

My scorecard sub-system, powered entirely by SQL Server, I referred to as ‘The Scorecard Engine’ (so original 🙄), but it could process anything that could be coded around XML and XSLTs. There were several transform/validation steps from data input (credit files + customer-supplied data + what we knew about the customer already) to returning a final ‘yes/no’ approval; I kept track of all the inputs, intermediate data, and outputs (using the Request / Response pattern), hence debugging at any stage was simple (and I created a suite of scripts purely for that purpose).

The XSLT function

I’m just going to focus on the XSLT functionality here; the original assembly also had other functions to validate XSDs (i.e. was it a valid Schema Definition format?), and to validate XML against XSDs.

As you’ve probably guessed, the functions were built using a SQLCLR assembly; see my recent post, Inverting a matrix in SQL Server for the fun I had creating the assembly in Visual Studio — all those comments apply here(!)

Here’s the code for ApplyXSLT, which takes two XML datatypes (the source XML, and the XSLT), and returns transformed XML.

using System;
using System.Data.SqlTypes;
using System.Xml;
using System.Xml.Schema;
using System.Xml.Xsl;
using System.IO;
using System.Text;

public partial class UserDefinedFunctions
{
  [Microsoft.SqlServer.Server.SqlFunction]
  public static SqlString ApplyXSLT(SqlXml inputXML, SqlXml inputXSLT)
  {
    XslCompiledTransform proc = new XslCompiledTransform();

    using (var txr = inputXSLT.CreateReader())
    {
      proc.Load(txr);
    }
    
    StringWriter stringWriter = new StringWriter();
    XmlTextWriter xmlWriter = new XmlTextWriter(stringWriter);

    using (var ixr = inputXML.CreateReader())
    {
      proc.Transform(ixr, xmlWriter);
    }

    return new System.Data.SqlTypes.SqlString(stringWriter.ToString());
  }

}

Full disclosure:

  • Like all good developers, I am liberally ‘inspired’ by stackoverflow and similar places, I might have borrowed some lines from e.g. this stackoverflow post.
  • This isn’t the ‘production’ version of the function. The final version that was published to SQL Server had a thorough vetting by experienced C# devs who know far more about these things than I do. The code above works perfectly on my local PC, though.

Using the function in SQL Server

So how do we use the CLR function in SQL Server? Below is a toy example that takes a very simple piece of XML, and transforms it using an XSLT:

DECLARE @inputXML XML
DECLARE @inputXSLT XML

SET @inputXML = '
<application>
  <person>
    <firstName>Pete</firstName>
    <lastName>Tonkin</lastName>
  </person>
</application>'

SET @inputXSLT = '<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <Customer>
      <FullName><xsl:value-of select="concat(//person/firstName,'' '',//person/lastName)"/></FullName>
    </Customer>
  </xsl:template>
</xsl:stylesheet>'

DECLARE @resultXML XML
SET @resultXML = dbo.ApplyXSLT(@inputXML, @inputXSLT)

SELECT @resultXML AS resultXML
GO

which returns:

<Customer>
  <FullName>Pete Tonkin</FullName>
</Customer>

The production XML files were up to several MB in size, and the scorecard XSLTs were several hundreds of lines long; but end-to-end, the whole process only took 150-200 milliseconds per application (which involved several rounds of applying XSLTs to XML). Not lightning fast, but perfectly adequate.


And they all lived happily ever after

I rarely blow my own trumpet, but within six weeks of me joining the new company, they had two new fully-tested scorecards (one for each loan product), along with the Scorecard Engine deployed into the main application pipeline. It worked, flawlessly — not a single bug or unexpected result in all the years it was in operation. As usual, this was down to extensive pre-release testing: I literally spent 2-3 weeks running old data through it before it was put live, looking for errors and surprises in the data.

As I recall, the only problem we ever had was when the credit bureau we used (one of the top 3) updated the layout of the credit files they sent us, and miscommunicated the change, which meant an outage for an hour or two while we hurriedly patched our configuration files. (I won’t be caught out like that again!)

At one point, I had plans to turn the Scorecard Engine into more of a self-contained product (rather than just something that lived inside SQL Server), but it never transpired. There’s a massive amount of work that has to be done to turn a few thousand lines of GUI-less code, into a shiny standalone application.

, , , , ,

Leave a comment

Good practice: MD5 everything

If you’re tech-savvy, you more than likely already know what MD5s are and what they’re used for, and hopefully, why they’re important. If that doesn’t apply, please read on…

In the previous post, I mentioned that I’d received a hard drive full of database backup files. Given that the drive was sent to us by a multi-billion dollar tech provider, I was disappointed to notice there was no md5 file. An md5 file is a simple text file containing 16-byte hashes of files, they look like this inside:

efe55da6f1f14dc194a100319a2738da *Egyptian fraction decomposition.R
583d89e10ed6cef093ba2989c95e3087 *std_functions_predict.R
04c7d6c9ac565236fbec6b3129e53342 *Term Calculation 2020-10-28.7z
160efdf20d1a677edfa07f6cb28ac498 *Term Calculation 2020-11-29.7z
510316459eadc3dfa404ea7de69db8f2 *Term Calculation/2020-11-30/2020-12-01 2042.RData
6dbc680a15f8f2f92405f93658155c27 *2019-07-27/hexadile/Franel/franel.pdf

Before we go any further, some terminology to keep things clear:

  • MD5: a hashing algorithm
  • md5 file: a file containing MD5 hashes, the Windows extension is .md5
  • source file(s): A file or files that we run the MD5 algorithm over to generate an md5 file
  • hash check: re-calculating the hash for a source file to check that it agrees with e.g. the hash in an accompanying md5 file

A hash is a representation of a source file’s contents, and it turns out you only need a small number of bytes for the hash to be incredibly useful. MD5 is just one algorithm, and there are others, e.g. Secure Hash Algorithms SHA-1, SHA-2 and SHA-3. Most of what I’m saying about MD5 here applies to those other hashes, I’m concentrating on MD5 because it’s so commonly-used.

Why do we use MD5s?

Several reasons, but the main application I’m concerned with here is file integrity. I want to know definitively that the files I receive or download haven’t been corrupted in any way by errors due to transmission or copying.

To take my original example, the tech provider should’ve MD5’ed the source files on their own machine, before copying them to the hard drive they sent us. I could then check for myself, using the md5 file, that the source files hadn’t been altered or corrupted.

When to use MD5s

I use them every single day! Every time I move any files between drives, I generate an md5 file first, and verify the destination files before I delete the original. The only times I don’t are (a) when moving a file around on the same drive; and (b) when the file is trivial to recreate (e.g. I’ve pasted the information in from a web page).

Note that some file formats have hashing ‘built-in’: for example, most (all?) compressed ‘zip’ formats; also the SQL Server .bak files I mentioned above.

For the even more careful person, the elderly but much-loved WinRAR can add a ‘recovery record’ to the .rar files it creates. This makes the .rar file slightly larger, but it can self-correct a certain number of errors, should the .rar file become corrupted. (See also PAR2 files.)

Also, torrent files contain hashes (SHA, not MD5) for pieces of the files being downloaded — however, there’s no single ‘overall’ hash for each file referenced by the torrent. Recently, I learned about pyrocore: “pyrocore is part of the PyroScope family of projects, and offers a collection of tools for the BitTorrent Protocol and especially the rTorrent client”, including hashcheck which lets you validate a download using the .torrent file, outside of a torrenting app.

Can’t I trust Windows to move or copy a file?

Frankly, no. I know this isn’t an official Microsoft statement, but still: Does Windows perform CRC when copying files between drives?: “Windows does not perform error detecting when copying files. That would introduce extra overhead that would slow a computer’s performance.” (A ‘CRC-check’ Cyclic redundancy check is similar to a hash check, people often use the terms interchangeably.)

It’s patently clear to me that Windows doesn’t do any checking when copying or moving files. I’ve been using Windows since version 3.1, and I would say that prior to Windows 7, copying files between drives, and around a network, I’d be experiencing failed copies (i.e. the destination file having a different hash to the source) multiple times per week. The situation is much better now, but it’s not perfect — I’d say I see 1-2 file transfers failing every 6 to 8 weeks.

Two current examples:

  • I downloaded a large zip file (~500MB) from Companies House last week (see my blog post SQL Server Full-Text Search for context), and there was an all-too-common network interruption; my Chrome browser let me resume the download. When it finished downloading, the zip file was corrupt (the ‘Test’ functionality failed within 7-Zip). If it wasn’t a zip file, I wouldn’t have had a way of checking the file integrity, as the website didn’t provide a hash for me to check.
  • I have a USB hub plugged into my PC to make connecting/disconnecting USB devices much easier. If I use two external HDDs at the same time, I get corrupted reads/writes around 50% of the time. If I use one drive at a time, zero problems. But Windows doesn’t ever alert me to these errors! I soon learned this was happening solely because I use md5 files as a force of habit!

How do I generate a .md5 file?

My standard ‘go-to’s are (a) MD5summer for Windows, it’s about the first thing I add to any new Windows installation; and (b) md5sum in Cygwin — as I’ve mentioned before, I use bash scripts in cygwin all the time, for manipulating and transforming files. I have a few custom scripts that I run to generate .md5 files for files that don’t already have them, and to verify .md5 files after copying.

Additionally, PowerShell in Windows has the Get-FileHash function, e.g.:

Get-FileHash -Algorithm MD5 folder\*.bak

which is fine for a single hash here and there, but as far as I can tell, it’s not straightforward to generate a standard .md5 file. (If I have to use PowerShell for this, it only takes a few seconds to convert the output into standard .md5 format using Find/Replace in Notepad++)

Other uses for MD5

Above, I’m only talking about the use of md5 files when moving or copying files around, but there are other uses, for example:

  • De-duplicating files: if I’ve thousands of files in a folder, some of which might be dupes that I want to remove, then I can generate an md5 file for the folder, open that file in Excel, then sort the hashes to find the dupes.
  • I use MD5 hashing extensively in databases when I’m storing blocks of XML. I typically have two indexed columns with MD5 hashes in: one hash for the compressed XML, another for the original uncompressed XML. Not only does it help with data integrity, but if I’m inserting or updating the XML blocks, I can use hashes to check whether the data already exists.

Wrap-up

Some people might read this post and think I’m being fussy and over-the-top — sure, maybe. But I see it as good data and time management: the time spent incorporating MD5s into my everyday workflow is far less than the time I would spend having to recreate, re-download, or replace corrupt files.

, , , , , , , ,

Leave a comment

Notes about using RESTORE VERIFYONLY

Recently, I decommissioned the entire data estate for a business*, the end result being a disk-full of database .BAK files that needed restoring elsewhere. Prior to restoration, I wanted to test the files, and check that the files we’d been sent were correct.

* One of the most painfully protracted projects I’ve ever been involved in, but that’s for another day…

Now, as I’ve mentioned many times before, I am not a DBA — but like most devs, I have to dip my toe in the water every once in a while, so it fell to me to verify the dozens of .BAK files. I’m not going to reproduce all my code here, but I’ll go over the main points.

RESTORE VERIFYONLY

SQL Server has the command RESTORE VERIFYONLY which “verifies the backup but does not restore it, and checks to see that the backup set is complete and the entire backup is readable”. The (very simple) syntax is:

RESTORE VERIFYONLY FROM DISK = 'C:\folder\backupfile.bak'

There’s some more detail here at the SQL Server Backup Academy site; also check out SQL interview questions on database backups, restores and recovery – Part III from SQLShack.

My first problem: each database backup had two .BAK files associated with it. According to the company that generated the backups, this was due to them using ‘streams’, and splitting the backup into two files “reduces the time it takes to generate the backup”, they said. (I know the backups contain the data file (MDF) and the log file (LDF), but this isn’t the reason for having two files.) Luckily, having multiple .BAK files isn’t an issue, we can just add extra DISK parameters:

RESTORE VERIFYONLY FROM DISK = 'C:\folder\backupfile1.bak'
, DISK = 'C:\folder\backupfile2.bak'

I run this and get:

Attempting to restore this backup may encounter storage space problems.
Subsequent messages will provide details.
The path specified by "H:\DATA\myDatabase.mdf" is not in a valid directory.
Directory lookup for the file "I:\LOGS\myDatabase_log.ldf" failed with the
operating system error 2(The system cannot find the file specified.).
The backup set on file 1 is valid.

So it looks like, even though we’re not actually restoring the database, it doesn’t like that the original database paths don’t exist. To get round this, we need to know the logical names of the files within the backup, and we can then add MOVE parameters to RESTORE VERIFYONLY. To see the logical names, we use a different command, RESTORE FILELISTONLY:

RESTORE FILELISTONLY FROM DISK = 'C:\folder\backupfile1.bak'
, DISK = 'C:\folder\backupfile2.bak'

which returns a recordset with 2 rows and 22 columns; the first column is LogicalName, which is what we need; let’s say the logical names are MyDatabase and MyDatabase_log. Our VERIFYONLY command becomes:

RESTORE VERIFYONLY FROM DISK = 'C:\folder\backupfile1.bak'
, DISK = 'C:\folder\backupfile2.bak'
WITH FILE = 1,
MOVE N'MyDatabase' TO N'C:\tmp\myDatabase.mdf', 
MOVE N'MyDatabase_log' TO N'C:\tmp\myDatabase_log.ldf'

where we’ve specified we’re MOVE-ing our files to some dummy files in our temp directory, C:\tmp\. (During all the testing, I didn’t see any files appear in this temp directory, not even briefly, so I don’t know why this step is necessary really.) Running the above, we get:

The backup set on file 1 is valid.

Unsurprisingly, the time it takes to verify a backup is proportional to its size; for a database with two 30 GB .BAK files, it took 11 minutes on my very old PC.

NB: FILE refers to the ‘backup set file number’, which for all our .BAK files, was 1 — if I changed it to 2, I got the error message “Cannot find file ID 2 on device”.

Looping over all databases

Of course, rather than run these commands dozens of times, I’d like to automate as much as I can — so I wrote code that did the following:

  • Got the full list of .BAK files using xp_dirtree
  • Turned the list of files into a ‘nice’ temp table that I could loop over, one row per database.
  • For each row in the temp table (i.e. database):
    1. Call RESTORE FILELISTONLY to get the logical names
    2. Using these, call RESTORE VERIFYONLY

Calling the RESTORE commands

I built my RESTORE FILELISTONLY code as a SQL string, EXEC-ed that, and stored the returned recordset in a table variable. See this very useful comment on stackoverflow for the table definition I used.

However, you can’t do exactly the same with RESTORE VERIFYONLY. You can EXEC the SQL string, as above, but if you try and store the results of the EXEC, you get:

Msg 3021, Level 16, State 0, Line 203
Cannot perform a backup or restore operation within a transaction.
Msg 3013, Level 16, State 1, Line 203
VERIFY DATABASE is terminating abnormally.

It looks like most people recommend putting the command in a TRY/CATCH block, something like the following:

BEGIN TRY
  RESTORE VERIFYONLY FROM DISK = @file1, DISK = @file2
    WITH FILE = 1
    , MOVE @Logical1 TO @tmp1
    , MOVE @Logical2 TO @tmp2
END TRY
BEGIN CATCH
   ... error handling here...
END CATCH

As you can see above, you can call RESTORE with @ variables, you don’t have to build a string and EXEC it — I just like to do that wherever possible, so I can PRINT the strings initially, therefore I know exactly what commands I’m going to be running.

, , , , , ,

Leave a comment

SQL Server Full-Text Search

If you’re searching within text (string) data in SQL Server, the easiest method is to use the LIKE operator with the wildcard ‘%’ character, e.g.:

SELECT * FROM Customers WHERE FirstName LIKE 'Pete%'

will return every customer with the first name Pete or Peter, plus less common forenames like Peterjohn and Peterson. The search string (‘Pete%’) can contain multiple wildcards, and can be used to look for ranges or sets of characters. LIKE is quite powerful, although sadly not as powerful as a full-featured regular expression parser (something that Unix/Linux scripters will be very familiar with). More importantly, LIKE can’t always use column indexes that you think might be appropriate. In the above query, if FirstName was indexed, then that index would be used, but if the search string was e.g. ‘%illy’ (looking for ‘Milly’, ‘Gilly’, ‘Tilly’ or ‘Billy’), then it wouldn’t.

For a more high-powered approach to searching text within database tables, SQL Server has Full-Text Search. From that link:

Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

I last used Full-Text Search a very long time ago — maybe in the days of SQL Server 7.0! — so I thought it was time to refresh my knowledge.

The dataset

I needed a reasonably large dataset for testing, so I downloaded the Free Company Data Product,
which is a “data snapshot containing basic company data of live companies on the register” at Companies House — basically, a CSV of every company currently incorporated here in the UK. The dataset I downloaded, dated 2024-02-07, was 452Mb zipped, 2.5Gb unzipped, and contains 5,499,110 rows and 55 columns.

After downloading and unzipping, I imported the CSV into SQL Server, then transferred it to a table with appropriate datatypes, also doing some tidying like removing NULL columns, and normalising out some categories to lookup tables. My main cleaned-up table is called cd.CompanyData (‘cd’ is the schema).

Not that it matters for what I’m doing here, but the data is pretty ropey in places; for example, the ‘previous name’ date fields unexpectedly switch from yyyy-mm-dd to dd/mm/yyyy. There are lots of duff postcodes, countries can have several slightly different names, and the address data is generally fairly low quality. For an important dataset from an official government source, I expected a lot better! If I was going to use this data in a professional setting, it’d take me a couple of days and a high-quality address cleaner to make it acceptable.

Creating the full-text index

I’m going to assume that you have the Full-Text Search component installed; it was literally a matter of ticking a checkbox on installation of SQL Server 2022 Developer edition. To check if it’s installed on your SQL Server instance, run:

SELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled')

‘1’ means yes, ‘0’ means no.

Firstly, we have to create a Full-Text Catalog. Quote: “You have to create a full-text catalog before you can create a full-text index” — it’s merely a logical container for the full-text indexes we’ll create. In practice, it’s one line of SQL:

CREATE FULLTEXT CATALOG myFirstFullTextCatalog AS DEFAULT

Now we can create our first full-text index; we need three things: the table name (cd.CompanyData), the column we want to index (CompanyName) and the name of a unique key on that table (we can use our primary key, PK_cd_CompanyData):

CREATE FULLTEXT INDEX ON cd.CompanyData(CompanyName)
   KEY INDEX PK_cd_CompanyData
   WITH STOPLIST = SYSTEM

See CREATE FULLTEXT INDEX for more details. Note that we didn’t have to specify the catalog, as we’ve already declared a default. (Also, it doesn’t have to be a table we’re indexing, we can also add full-text indexes to indexed views.)

What does ‘STOPLIST = SYSTEM’ mean? The STOPLIST is a list of words that aren’t indexed, words like ‘and’, ‘but’, ‘in’, ‘is’. To see the SYSTEM list we’ve used here:

SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 1033

For me, it returns 154 rows, a list of words and some single characters. (NB: 1033 refers to English) You can curate your own STOPLISTs: Configure and Manage Stopwords and Stoplists for Full-Text Search

IMPORTANT: You can only add one full-text index to a table (or indexed view), but you can have multiple columns per index.

Using the index

You’ll notice that the CREATE FULLTEXT INDEX statement returns immediately; that doesn’t mean the index has been created that quickly, just that the indexing process has started. To keep an eye on the progress, I use the code from this stackoverflow answer. (For the index above, it took a couple of minutes on my old and severely under-powered Windows 10 PC; for multiple columns, it a took a bit longer.)

There are a number of ways to query the index: see Query with Full-Text Search, and Full-Text Search and Semantic Search Functions. Here, we’ll use the table function FREETEXTTABLE.

Let’s query the data, and look for all companies with the word ‘Tonkin’ in, order them by newest incorporated first:

;WITH cte_Search AS
(
	SELECT
			ft.[KEY]
			,ft.[RANK]
		FROM FREETEXTTABLE(cd.CompanyData, CompanyName, 'Tonkin') ft
)
SELECT
		c.CompanyNumber
		,c.CompanyName
		,c.IncorporationDate
		,ft.[RANK] 
	FROM cte_Search ft
	JOIN cd.CompanyData c
		ON c.CompanyNumber = ft.[KEY]
	ORDER BY c.IncorporationDate DESC

There are 30 results, first 16 shown here:

A recordset of companies with TONKIN in the name

The results are returned within a fraction of a second; the equivalent query using LIKE took nearly 10 seconds.

Aside: Why have I written the query using a CTE, instead of a simple JOIN? When researching this post, I came across this from Brent Ozar: Why Full Text’s CONTAINS Queries Are So Slow

Apart from the increase in speed, the full-text index includes related words, e.g. if I change the query to search for ‘machine’, I get 2153 results where the company names contain ‘MACHINE’, ‘MACHINES’ and ‘MACHINING’. If I search for ‘move’, I get 2497 results where the company names contain ‘MOVE’, ‘MOVING’, ‘MOVED’, ‘MOVES’ and ‘MOVERS’.

But what’s going on with the RANK column? Excellent question, why are they all the same number? It’s to do with the query string we’re passing to FREETEXTTABLE; if it was more complicated, e.g. ‘move OR clean’, then I get eight different values for RANK: 9,13,17,24,30,40,41,43. This page, Limit search results with RANK, has some interesting tidbits.

Loads more to do…

I’ve only lightly scratched the surface here; I know from experience that Full-Text Search is something to seriously consider for a production system, but I’d like a few days of playing around with real data, and real use-cases, before committing. If you’ve an on-going need to locate pieces of text in a large table (e.g. customer search functionality), there are a few things that can be done without installing Full-Text Search:

  • Ensuring current indexes are adequate, and actively being used by the SQL engine
  • Indexing the first few characters of a string
  • Splitting out words or other parts of strings into a separate table and indexing that
  • Caching the most popular searches
  • Storing or replicating the data to a Memory-Optimized Table
  • Hashing words, pairs/triples of words, or full phrases, and indexing the hashes
  • Enforcing other criteria to cut-down the search space, e.g. live customer? Applied this year?

and I’m sure I can come up with a few more, given time!

, , , ,

Leave a comment

Inverting a matrix in SQL

As long ago as yesterday, I wrote:

I expect someone somewhere has converted pre-existing Fortran or C/C++ code into pure T-SQL. (The algorithm is fairly straightforward, so I doubt it’s that hard to do the conversion.)

Since writing those words, my brain has done nothing but nag at me, so guess what I did this morning?

NB: T-SQL is Microsoft’s own version of SQL — they’re close, but not exactly the same thing. I’ve left the title of the post as ‘SQL’ not ‘T-SQL’, because it should be straightforward to convert the guts of the code to generic SQL, if needed.

Picking an algorithm

No need to come up with my own matrix inversion algorithm, there are plenty to choose from; a google search led me to An Efficient and Simple Algorithm for Matrix Inversion by Farooq and Hamid, from 2010. Here’s their algorithm, written in C, from their paper:

C code to invert a matrix

It looks beautifully simple, and therefore, should be pretty easy to convert to SQL!

Conversion to SQL

Firstly, it’d be helpful to have a ‘matrix’ SQL type:

CREATE TYPE MatrixType AS TABLE
(
	[row] INT NOT NULL
	,[col] INT NOT NULL
	,[value] FLOAT NOT NULL
	,PRIMARY KEY CLUSTERED ([row],[col])
)

Using this type for my input parameter, here’s my conversion of Farooq and Hamid’s code (note that I’m using a function, not a stored procedure):

CREATE FUNCTION dbo.fn_InvertMatrix
(	
	@matrix MatrixType READONLY
)
RETURNS @result TABLE 
(
	[row] INT NOT NULL
	,[col] INT NOT NULL
	,[value] FLOAT NOT NULL
	,PRIMARY KEY CLUSTERED ([row],[col])
)
AS
BEGIN

  DECLARE @wkgMatrix MatrixType
  
  INSERT @wkgMatrix([row],[col],[value])
    SELECT [row],[col], [value]
    FROM @matrix

  DECLARE @size INT = (SELECT MAX([row]) FROM @wkgMatrix)
  
  DECLARE @det FLOAT, @pivot FLOAT
      ,@p INT, @i INT, @j INT
      ,@mat_pj FLOAT, @mat_ip FLOAT

  SET @det = 1.0
  SET @p = 1

  WHILE(@p <= @size)
  BEGIN

    SELECT @pivot = [value]
      FROM @wkgMatrix
      WHERE [row] = @p AND [col] = @p

    SET @det = @det * @pivot

    IF (ABS(@det) < 1e-5)
    BEGIN
      RETURN
    END

    UPDATE @wkgMatrix SET [value] = -[value]/@pivot WHERE [col] = @p

    SET @i = 1
    WHILE (@i <= @size)
    BEGIN
      
      IF (@i <> @p)
      BEGIN

        SELECT @mat_ip = [value] FROM @wkgMatrix WHERE [row] = @i AND [col] = @p

        SET @j = 1
        WHILE (@j <= @size)
        BEGIN
          IF (@j <> @p)
          BEGIN
            SELECT @mat_pj = [value] FROM @wkgMatrix WHERE [row] = @p AND [col] = @j
            UPDATE @wkgMatrix SET [value] = [value] + (@mat_pj * @mat_ip)
                WHERE [row] = @i AND [col] = @j
          END
          SET @j = @j + 1
        END
      END

      SET @i = @i + 1

    END
    
    UPDATE @wkgMatrix SET [value] = [value] / @pivot WHERE [row] = @p

    UPDATE @wkgMatrix SET [value] = 1.0/@pivot WHERE [row] = @p AND [col] = @p

    SET @p = @p + 1
  END

  INSERT @result([row],[col],[value])
    SELECT [row],[col], [value]
      FROM @wkgMatrix

  RETURN
END

Notes about this conversion:

  • I’ve written the code as a function, it should be trivial to convert to a sproc if needed.
  • I’ve removed a couple of loops from the original version; I think someone with time on their hands could make the code even smaller? Maybe a recursive CTE could work?
  • There’s very little error-checking, e.g. I should at least check that the matrix is square. It’s not production-ready code — if it was, there’d be comments and headers.

So, does the code actually work? How can we test it? Let’s generate a random square matrix (from 2×2 up to 9×9), invert it, then check that when multiplied together, we get the identity matrix — then let’s do this thousands of times.

We’ll need a helpful SQL function to multiply two matrices together:

CREATE FUNCTION dbo.fn_MatrixMultiply
(	
	@m1 MatrixType READONLY
	,@m2 MatrixType READONLY
)
RETURNS TABLE 
AS
RETURN 
(
  SELECT
      m1.[row]
      ,m2.[col]
      ,[value] = SUM(m1.[value] * m2.[value])
    FROM @m1 m1
    JOIN @m2 m2
      ON m2.[row] = m1.[col]
    GROUP BY m1.[row], m2.[col]
)

And we’re ready — let’s do a few thousand iterations, here’s my testing code:

DROP TABLE IF EXISTS #results
GO

CREATE TABLE #results
(
  id INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
  ,size INT NOT NULL
  ,[diag] FLOAT NOT NULL
  ,[offdiag] FLOAT NOT NULL
)
GO

SET NOCOUNT ON 

DECLARE @Z INT = 1

DECLARE @m1 AS MatrixType
DECLARE @m2 AS MatrixType
DECLARE @mr AS MatrixType
DECLARE @size AS INT

WHILE(@Z <= 3000)
BEGIN

  SET @size =  2 + ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) % 8 -- random from 2 to 9

  DELETE FROM @m1
  INSERT @m1([row],[col],[value])
    SELECT x.r AS [row], y.c AS [col]
           ,[value] = CAST(CAST(CAST(NEWID() AS VARBINARY) AS INT) AS FLOAT)/1000000000.0
      FROM (
        VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9)
      ) x(r)
      CROSS JOIN (
        VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9)
      ) y(c)
      WHERE x.r <= @size
      AND y.c <= @size

  DELETE FROM @m2
  INSERT @m2([row],[col],[value])
    SELECT [row],[col],[value] FROM dbo.fn_InvertMatrix(@m1)

  DELETE FROM @mr
  INSERT @mr([row],[col],[value])
    SELECT [row],[col],[value] = ROUND([value],7)
      FROM dbo.fn_MatrixMultiply(@m1, @m2)

  INSERT #results(size,diag,offdiag)
  SELECT 
      size    = @size 
      ,diag    = (SELECT SUM(CASE WHEN [value] = 1.0 THEN 1 ELSE 0 END)
                    FROM @mr WHERE [row] = [col])
      ,offdiag  = (SELECT SUM([value]) FROM @mr WHERE [row] <> [col])

  SET @Z = @Z + 1

END
GO

If everything has worked as expected, then in every case, the size of the matrix will be the same as the sum of the diagonal elements, and the sum of the off-diagonal elements will be zero:

SELECT
	N		= COUNT(1)
	,diag_ok	= SUM(CASE WHEN size = diag THEN 1 ELSE 0 END)
	,offdiag_ok	= SUM(CASE WHEN offdiag = 0 THEN 1 ELSE 0 END)
FROM #results
GO
N           diag_ok     offdiag_ok
----------- ----------- -----------
3000        3000        3000

All good, looks like the code has worked perfectly!


Finally

Again, I’ve done nothing clever here, just piggy-backed off someone else’s good work. But now I’ve got my own ‘native’ SQL function to play with, no more messing about with SQLCLRs — quite the relief!

A few final points:

  • Although I think the code works fine, I need to extensively compare the results with my SQLCLR function. For instance, my random matrices didn’t have any zeros in, and I restricted the range of the cell values, what happens with large values? Or a mixture of very big and very small values? What about much larger matrices?
  • How fast is the code? It doesn’t matter for my use-case, but it might for other people’s. Maybe the SQLCLR function is much faster.
  • Not all matrices are invertible, why didn’t I get any errors? This is answered here: Probability question: what are the odds that a random nxn matrix is invertible?
  • Often, the determinant of the matrix is useful, I should work out a way to return that.
  • Before converting to SQL, I converted the code to R, and now I really want to make an ‘arbitrary precision version of it, using the gmp library…
  • It feels like I should start my own library of SQL functions to work with matrices — one day!

, , , , , , ,

Leave a comment