Archive for September, 2016

Using median rather than mean

Analysts in consumer finance are used to dealing with large sets of incoming customer data, and are often asked to provide summary statistics to show whether application quality is changing over time: Are credit scores going down? What’s the average monthly income? What’s the average age of the customers applying this quarter, compared to last?

Early on in the process, the data is likely to be messy, unsanitised, and potentially chock-full of errors. (Once the data has been processed to the point of considering credit-worthiness, then it’s all clean and perfect, yeah..?) Therefore, you have to be careful when you report on this data, as it’s easy to get nonsense figures that will send your marketing and risk people off in the wrong direction.

Two really common errors seen in numerical application data:

1. Putting gross yearly salary into the net monthly income field, and it’s not always easy to spot which one it ought to be: if it says ‘£50000’, then it’s very likely to be an annual figure; but what about ‘£7000’? If monthly, that’s a really good wage; but I can assure you, people earning that much are still applying for credit.

2. Dates of birth: if unknown, it’s common to see ‘1st January 1900’ and similar. So when you convert it to age, the customer is over 100.

Also, if you happen to get credit scores with your data, you have to watch out for magic numbers like 9999 – which to Callcredit means “no [credit] score could be generated”, not that the customer has the credit-worthiness of Bill Gates or George Soros.

Hence, it’s fairly obvious, that if you’re including these figures in mean averages, you’re going to be giving a misleading impression, and people can infer the wrong thing. For example, say you’ve 99 applications with an average monthly income of £2000, but there’s a also an incorrect application with a figure of £50,000. If you report the mean, you’ll get an answer of £2480, instead of the correct £2010 (assuming that £50k salary translates to ~£3k take-home per month). However, if you report the median, you’ll get an answer of £2000, whether the incorrect data is in there or not.

In statistical parlance, the median is “a robust measure of central tendency”, whereas the mean is not. The median isn’t affected by a few outliers (at either end).


End note: Credit scores can be (roughly) normally distributed; for a normal distribution, the median and mean (and mode) are the same. But data doesn’t have to be normally distributed: e.g. call-waiting times follow the exponential distribution, where the median and mean are not the same.

, , , ,

Leave a comment

XSDs FTW

I’m a big fan of passing data to and from stored procedures (sprocs) as XML, especially XML that represents a complete object, or a list or hierarchy of objects. For a start, XML is perfectly human-readable (if you’re doing it right), and nearly every system and language knows how to work with it, SQL Server / TSQL included. What makes it even better, is being able to validate the XML before you even begin to parse it, using an XSD (XML Schema Definition).

Here’s a complete example you can copy and run:

USE tempdb
GO

-- If the XSD already exists, drop it:

IF EXISTS (
  SELECT xsc.name 
    FROM sys.xml_schema_collections xsc
    WHERE xsc.name='TestSchema'
)
BEGIN
  DROP XML SCHEMA COLLECTION TestSchema
END
GO

-- Create the schema:

CREATE XML SCHEMA COLLECTION TestSchema AS '
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:simpleType name="ST_EmailAddress">
      <xsd:restriction base="xsd:string">
      <xsd:pattern value="[^@]*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9}"/>
      </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name="ST_Usage">
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="home"/>
      <xsd:enumeration value="work"/>
      <xsd:enumeration value="other"/>
    </xsd:restriction>  
  </xsd:simpleType>

  <xsd:complexType name="CT_EmailAndUsage">
    <xsd:simpleContent>
      <xsd:extension base="ST_EmailAddress">
        <xsd:attribute name="usage" use="required" type="ST_Usage" />
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

  <xsd:element name="emailList">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="email" type="CT_EmailAndUsage" minOccurs="1" maxOccurs="3" />
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>'
GO

-- Make some dummy data that conforms to the schema above:

DECLARE @testXML AS XML(TestSchema)

SET @testXML = '
<emailList>
  <email usage="home">pete@home.com</email>
  <email usage="work">pete@work.com</email>
  <email usage="other">pete@other.com</email>
</emailList>
'

-- Query it:

SELECT
    id = ROW_NUMBER() OVER (ORDER BY e.i)
    ,EmailAddress = e.i.value('(.)[1]','VARCHAR(255)')
    ,Usage = e.i.value('(@usage)[1]', 'VARCHAR(20)')
  FROM @testXML.nodes('//email') AS e(i)
GO

The result set is:


id   EmailAddress    Usage
---  --------------  ------
1    pete@home.com   home
2    pete@work.com   work
3    pete@other.com  other

(3 row(s) affected)

Now, try messing around with the contents of the @testXML variable, e.g.:

  1. Set usage to a string that’s not ‘home’, ‘work’ or ‘other’
  2. Add a fourth email address
  3. Take the ‘@’ symbol out of an email address
  4. Put in some extra nodes that don’t belong

,then re-run the code. They all fail, because the XML has to conform to the XSD we’ve defined as TestSchema. So, SQL Server automatically rejects any input that fails data validation (e.g. format of email address) or breaks business logic (‘no more than three emails’); if the XML was being passed to a sproc, the call would fail, and no code inside would ever run.

Obviously, you may not want to automatically reject ‘broken’ XML, you’ll probably want to record this fact. That’s fine – your code (sproc) can accept a schema-less XML, and attempt the cast itself; and if it fails, you can respond however you like.

There’s certainly an overhead in learning the language of XSDs, but because it’s a generic technology, there are plenty of online resources, e.g. w3schools. When it comes to transferring complex objects around as data, I don’t know of a better way than using XML and XSD.

Note

Because Microsoft haven’t got round to coding it yet, you can’t query the text value of a node that’s been defined as a type in XSD. That is, I’d ordinarily like to be able to query the email address itself like this:

EmailAddress = e.i.value('(./text())[1]','VARCHAR(255)')

, because directly accessing the node text is faster (presumably because it doesn’t have to do any more parsing). But sadly, it’ll just fail with an error. However, this is unlikely to cause practical problems, it’s just a mild annoyance that’s vastly outweighed by the benefits that come from validated XML.

, ,

Leave a comment

Converting JSON to XML : A Gateway to Cygwin

Of the files I have to deal with on a weekly basis, I’d put the breakdown at 50% Excel, 40% CSV, and 10% XML. This is fine, I can reliably transfer data from those files into SQL Server without too many tears. However, today I was presented with JSON-formatted versions of files, that I’d normally get as XML; and I haven’t had to deal with JSON since I last wrote PHP/AJAX code, about five years ago.

Now, SQL Server 2016 can natively read/write JSON code (see, for example, this MSDN blog), but I use SQL Server 2014, which knows nothing about JSON.

Of course, I googled for JSON to XML converters. There are many, mostly in the form of libraries for other systems, and even a few online converters that would do the job ‘in-browser’. Unfortunately, the data I needed to convert was credit file data, and that data is sensitive. I can’t just go pasting it into unknown websites without completely understanding what’s going to happen to it – if there’s the slightest chance my data could get uploaded and saved elsewhere, I can’t use that site. I did find an online site that did the conversion purely in javascript (no POSTs back to the server), so I copied the code locally, pasted my JSON in… and it crashed the browser (Chrome). Turns out 80kb of JSON was too much for this javascript, and in fact, a couple of the standalone converters I tried also had trouble with this (small) amount of code.

There was even a pure T-SQL converter (written as a function) that I tried, but unfortunately, that didn’t work out either. Which is a shame, as a SQL-based solution appeals to me greatly!

To cut a dull story short, here’s how I did it: perl. Thanks to the third most popular answer to this stackoverflow question, the answer was to open up a cygwin window, and type:


cat MyFile.json | perl -MJSON -MXML::Simple -e 'print XMLout(decode_json(do{local$/;}),RootName=>"json")' > MyFile.xml

(Thank you very much, stackoverflow user azatoth!)

And that did the trick; I had to do some minor tidying up (due to @ symbols, and node names starting with a number), but in the main, it did the job for me, with a minimum of effort.

The point of this post is two-fold:

  1. When this requirement crops up again, I only have to look here to remind myself, and…
  2. To spread the word about cygwin.

Cygwin (www.cygwin.com) is a way to get hold of Unix/Linux-style functionality on Windows. I’ve used it for years now, and it’s an invaluable part of what I do; it’s literally one of the first things I install on any new machine.

If you do any significant amount of text file processing, there are many great command-line tools to be found within the cygwin environment; just a few I use on at least a weekly, if not daily, basis:

  • grep: for searching through text files, using regular expressions
  • sed: for basic text transformations
  • awk: a programming language for text processing
  • perl: a programming language widely used in the *nix world

The beauty of these tools, is that they’re so widely used, it’s almost guaranteed that whatever you want to do, someone else has already put the correct syntax online (cf. my JSON to XML problem). Usually, some light googling (often incorporating the term ‘+stackoverflow’) will get you your answer. I wouldn’t claim for a second that I ‘knew’ these tools (apart from maybe grep), but being able to work with them is enough.

If you’re a developer or analyst who has to routinely work with data files, I can’t recommend cygwin highly enough.

, , ,

Leave a comment

Floats may not look distinct

The temporary table #Data contains the following:


SELECT * FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)

Three copies of the same number, right? However:


SELECT DISTINCT value FROM #Data
GO

value
-------
123.456
123.456
123.456

(3 row(s) affected)

We have the exact same result set. How can this be?

It’s because what’s being displayed isn’t necessarily what’s stored internally. This should make it clearer:


SELECT remainder = (value - 123.456) FROM #Data
GO

remainder
----------------------
9.9475983006414E-14
1.4210854715202E-14
0

(3 row(s) affected)

The numbers aren’t all 123.456 exactly; the data is in floating-point format, and two of the values were ever-so-slightly larger. The lesson is: be very careful when using aggregate functions on columns declared as type float.

Some other observations:

  • The above will probably be reminiscent to anyone who’s done much text wrangling in SQL. Strings look identical to the eye, but different to SQL Server’s processing engine; you end up having to examine every character, finding and eliminating extraneous tabs (ASCII code 9), carriage returns (ASCII code 13), line-feeds (ASCII code 10), or even weirder.
  • If your requirement warrants it, I can thoroughly recommend the GNU Multiple Precision Arithmetic Library, which stores numbers to arbitrary precision. It’s available as libraries for C/C++, and as the R package gmp:

# In R:

> choose(200,50);  # This is 200! / (150! 50!)
[1] 4.538584e+47
> library(gmp);
Attaching package: ‘gmp’
> chooseZ(200,50);
Big Integer ('bigz') :
[1] 453858377923246061067441390280868162761998660528

# Dividing numbers:
> as.bigz(123456789012345678901234567890) / as.bigz(9876543210)
Big Rational ('bigq') :
[1] 61728394506172838938859798528 / 4938271605
# ^^ the result is stored as a rational, in canonical form.

, , , ,

Leave a comment