Posts Tagged data type

Determining SQL datatypes from wide CSV data

Thirty second summary

When importing a CSV file, SQL Server’s Import/Export wizard isn’t great at working out the correct column datatypes for the new table — which is especially painful when the CSV is wide (has many columns) — so I wrote some SQL code to get round the issue.

 

Cartoon of a man looking up at a huge spreadsheet

Everyone working in BI / data science has to move data around between apps all the time, and it’s a constant surprise to me how this is still somewhat of a pain. In particular, getting CSV data into a SQL Server table with the datatypes you require is a very hit-and-miss affair. The GUI tools SSMS provides you with aren’t really up to the task, especially when the CSV files are very wide.

In this post, I’ll demonstrate how I tackled this problem.

 

Motivation

Warning: This section is quite waffle-y, please feel free to skip

In 2018, the consumer lender Home Credit ran a competition on the data science community site kaggle, to see who could come up with the best predictive model for a given data set: “The objective of this competition is to use historical loan application data to predict whether or not an applicant will be able to repay a loan.” All the details are given here.

The competition closed 3 months later with a whopping 7180 entries! The winning team, comprising 6 senior data scientists, won $35,000. In discussion about their winning code, they wrote: “… for this kind of competition and this kind of problem, feature engineering and feature selection are still the most important step.”

Feature engineering means manipulating the data you have to generate more predictive variables, and it’s hands down my favourite part of building models — it’s basically playing with data, using as much creativity as you can muster. Some examples from my own work:

  • Parsing business names and addresses to tease out company-level data that didn’t otherwise exist
  • Sub-models of ‘customer enquiry’ data to add a much richer ‘search score’ (how ‘urgent’ the customer’s requirement for the money is)
  • Combining socio-demographic statuses to paint clearer pictures of applicants
  • Augmenting residential statuses with given electoral roll data to better explain a customer’s living situation
  • Generating huge numbers of ratios concerned with debt, income, expenses, and all aspects of affordability

It’s then straightforward to weed out those variables that aren’t predictive (part of feature selection).

Having some time on my hands, I decided to play with this dataset, and see how close I could get to the performance of the winning teams. I’m not completely daft — I’m not expecting much, some (all?) of the winning team are PhD ‘Heads of Data Science’ at huge companies. Still, it seems like a fun thing to try. Stay tuned for a future post on my efforts!

Cartoon of a man cuddling a monitor with SQL displayed on it

SQL Is Best

Now, because I’ve used it, day in and day out, for so many years, I’m accustomed to getting to grips with my data, trying to really understand it, in SQL Server — not Python, or even R. So, I want to get the data into SQL Server first. And here’s the stumbling block: we have multiple CSV files, some of them with over a hundred columns.

Anyone who uses SSMS/SQL Server knows that the Import Wizard is… let’s be polite, ‘lacking in features’. Or less polite, bloody annoying. For me, a far more flexible and fruitful approach is to import all the data as text ([N]VARCHAR), and do all our transformations using pure SQL.

The first CSV I’m looking at (application_train.csv from the competition dataset) has 122 columns, and while I could work out the datatypes manually, it’s more fun to write code to do it for me!

 

My solution

Let’s get the data into SQL Server as text.

Getting the raw data into SQL Server: BULK INSERT

I copied/pivoted the column names from the first row of the CSV, and created a SQL table app_train_raw that consisted of one integer column (the unique customer ID), and 121 VARCHAR(50) columns. I then inserted the data using BULK INSERT:

BULK INSERT dbo.app_train_raw
FROM '...\application_train.csv'
WITH (
   FORMAT = 'CSV'
   ,FIRSTROW = 2 -- i.e. skip header row
   ,ROWTERMINATOR = '0x0a' -- character 10, Line Feed
)

It took a few seconds to insert the 307,511 rows, a total of 37,516,342 data points.

Warning: BULK INSERT doesn’t give good error messages. I kept getting an error about ‘conversion error (truncation)’, so spent ages messing with text encodings, trying to find rogue characters in the CSV — when in fact, I’d miscounted the number of columns!

The code

I’m not going to present the entirety of my code, because (a) this post would be too long, (b) it’s just ‘glue’ code. I maintain a config table with one row per column from the original CSV, and whether I’ve already determined the data type, or created a lookup. Also, I’m doing this all iteratively: identifying the types of columns as I go along, then re-running sections of code, ignoring those with known types.

From here on, I’ll post only the most relevant bits.

Initial categorisation

In my post Aggregating all the categorical data in a table, there’s code I wrote to, well, aggregate all the categorical data from a table (in order to calculate the Population Stability Index for each variable in the models).

We can use the exact same code here, and run it over all 121 VARCHAR(50) columns. You might think it would take a long time to GROUP all these columns, but it took just 19 seconds on my old PC. We’re left with the global temp table ##Results, containing 388,218 rows: each row is a distinct variable/value pair, we’re not interested in duplicate values.

Analysing the categories

The first thing we can do is look for obvious candidates for columns that we can ‘normalise out’ as lookup tables:

SELECT [var], COUNT(1)
FROM ##Results
GROUP BY [var]
HAVING COUNT(1) < 30

This led me to 13 columns (mostly ending _STATUS, _TYPE) that I turned in to lookup tables. This didn’t take long, and could easily be automated.

Next, let’s gather some information about the remaining columns (I’d already removed the 13 lookup columns from ##Results):

SELECT 
 [var]
 ,HasNULL     = SIGN(SUM(IIF([value] IS NULL, 1, 0)))
 ,NumCats     = COUNT(1)
 ,NumNumeric  = SUM(ISNUMERIC([value]))
 ,MinVal      = MIN([value])
 ,MaxVal      = MAX([value])
 ,MinLen      = MIN(LEN([value]))
 ,MaxLen      = MAX(LEN([value]))
 ,NumWithDP   = SUM(CASE WHEN CHARINDEX('.',[value]) > 0 THEN 1 ELSE 0 END)
 ,NumWithSN   = SUM(CASE WHEN PATINDEX('%[0-9]e[+-][0-9]%',[value]) > 0 THEN 1 ELSE 0 END)
 ,NumWithNeg  = SUM(CASE WHEN LEFT([value],1) = '-' THEN 1 ELSE 0 END)
FROM ##Results
GROUP BY [var]

The code is fairly self-explanatory: var is the column of interest, and we’re counting the number of distinct instances of values where the value is numeric, or contains a period (decimal point), or scientific notation (‘Xe-Y’), or a minus sign at the start. We also want to know the min/max alphabetic values (for identifying 0/1 or N/Y), and the min/max length of the values (for deciding between FLOAT and DECIMAL).

Using this result set, we can say:

  • If there are only 2 categories (or 3 including NULL), and the min/max values are 0/1 or N/Y, then we’ll treat this column as a BIT
  • If there are no non-numeric values, and any string contains scientific notation, or contains a decimal point and is longer than 18 characters, we’ll treat this as a FLOAT
  • If there are no non-numeric values, and no decimal points, then we’ve either got a BIGINT, INT, SMALLINT or TINYINT (not forgetting that TINYINTs can’t be negative, so we check that the first character isn’t a minus sign

This takes care of over half the columns.

Numeric types

We’ve already determined some of the numeric types, but we need to investigate further. Mainly, we’re concerned with identifying the correct precision/scale parameters for DECIMAL types. Note that the first CTE only considers numeric columns we haven’t identified the correct types for yet.

;WITH cte_UnknownOnly AS
(
SELECT
   r.[var]
   ,floatval = CAST(r.[value] AS FLOAT)
   ,lp       = LEFT(r.[value], CHARINDEX('.', r.[value])-1)
   ,rp       = SUBSTRING(r.[value], 1+CHARINDEX('.', r.[value]), 255)
 FROM ##Results r
 WHERE r.var IN (... {unknown numeric columns only}... )
)
SELECT
  r.[var]
  ,MinAsFLOAT = MIN(r.floatval)
  ,MaxAsFLOAT = MAX(r.floatval)
  ,MinLenLP   = MIN(LEN(r.lp))
  ,MaxLenLP   = MAX(LEN(r.lp))
  ,MinLenRP   = MIN(LEN(r.rp))
  ,MaxLenRP   = MAX(LEN(r.rp))
  ,NumDistRP  = COUNT(DISTINCT r.rp)
  ,MinINTLP   = MIN(CAST(r.lp AS INT))
  ,MaxINTLP   = MAX(CAST(r.lp AS INT))
  ,MinINTRP   = MIN(CAST(r.rp AS INT))
  ,MaxINTRP   = MAX(CAST(r.rp AS INT))
FROM cte_UnknownOnly r
GROUP BY r.[var]

where LP refers to the piece on the left of the decimal point (the integer part), and RP is the piece on the right (the fractional part).

From this result set, it’s easy to work out the correct parameters for DECIMALs using MaxLenLP and MaxLenRP. NumDistRP is useful for deciding whether to simplify some types, e.g. salaries are usually given as whole numbers, but a few may contain fractional parts due to calculations, and you may wish to round the figures for simplicity.

This step took care of the rest of the columns — we now know the correct datatypes for the whole dataset.


Wrap-up

I hear you say: “This looks like a lot of work, I can eyeball the column types faster and just use the wizard.” Good luck to you, I say — I’ve spent far too many hours of my life looking at unhelpful error messages from the Import wizard. And what if you’ve got multiple CSVs with over a thousand columns in each? It’d take days to get all the types correct. This code gives me all the control over the datatypes I could want. (Consider also that I generate the INSERT/SELECT statements from the same config table; I’ve omitted the details here to save space.)

Could I ‘product-ise’ this to make it a single piece of code for anyone to run? Sort of, but not really. There are so many edge cases and ‘quirks’ to cater for. As an example here, the columns prefixed with ‘AMT_REQ_CREDIT_BUREAU’ are presented in the data as having one decimal place, but in reality, they’re all integers (as they’re counts of enquiries).

Note that if the CSV file is too big, you could restrict the categorisation to the first 1000 rows (say). This would probably be ok for well-behaved datasets… but when was the last time you got a well-behaved dataset to work on?


Finally, an anecdote: I once was involved in a project to continuously transfer data from one internal system (Microsoft Dynamics) to another (our SQL Server instance) — the ‘official’ Microsoft connector was unreliable, and flaked out several times a week. The company chose a third-party provider that had a product they claimed could transfer data between hundreds of different systems, and did exactly what we needed — or so we thought. Turns out, after we built our import layer, the product only transferred the data stripped of all formatting, i.e. we received only strings into the target system, whether the original data had been numbers, dates or text. Much swearing ensued! We couldn’t believe this was correct, and the provider was surprised that we were annoyed (“that’s how everyone does it” — is that true? Let me know if you think so…), and we then had to write a ton more code to convert the strings back to the datatypes we needed [which was where I first realised you could use SQL to work out the types]. The lesson I take away is: always check even your most utterly basic assumptions!

, , , , , , , ,

1 Comment