Archive for category chat

Brute-force packing

Because I’m a huge music fan, I own quite a few CDs; rather more than I’d readily admit to! A few years ago, I started ripping my collection to FLAC, a lossless audio format, and backing everything up to DVD-R. In order to be efficient, I want to get as many albums as I can on each DVD-R disc, minimising the space wasted — this is known as a Packing Problem.

Now, I’d love to be able to present you with a highly-optimised algorithm that I wrote, but that’s not what I did: I brute-forced it. Processor cycles are very cheap, and if it’s going to be orders of magnitude quicker to iterate a few million times than it will be to research a whole new area of maths, then iterating it’ll be. My original code was a VB app (so I could drag folders onto a GUI), but here’s a similar version of the code in R:

set.seed(1234);
containerSize <- 4500; # roughly DVD size in MB
itemSize <- c(1641,1498,1747,751,1090,164,1602,1020,1126,553); # album sizes in MB
cat(sprintf("No. containers needed (no partitioning): %5.2f\n", sum(itemSize) / containerSize));

Z <- 1000; # Number of iterations

# To keep track of the best partition
best.remainder <- 1.0;
best.partition <- NULL;

for(i in 1:Z) {

working <- sample(itemSize); # randomly re-order our list of sizes
partition <- list();
k <- 1;
# Using the order as per 'working', partition the items
# such that the container size isn't exceeded:
while (length(working) > 0) {
  this.partition.indexes <- which( cumsum(working) <= containerSize );
  partition[[k]] <- working[this.partition.indexes];
  working <- working[-(this.partition.indexes)];
  k <- k+1;
}
npm1 <- length(partition) - 1; # Number of partitions minus 1
partition.totals <- unlist(lapply(partition, sum));
remainder <- (sum(rep(containerSize, npm1) - partition.totals[1:npm1]))
    / (npm1 * containerSize);

if (remainder < best.remainder) {
  best.remainder <- remainder;
  best.partition <- partition;
  totals.str <- paste("(", paste(partition.totals, collapse=","), ")", sep="");
  partition.str <- paste(unlist(lapply(partition,
      function(x) paste("(",paste(x,collapse=","),")",sep=""))),collapse=",")
  cat(sprintf("i = %3d, rem. = %5.2f%%; totals = %s; partition = %s\n", i,
      remainder * 100.0), totals.str, partition.str));
}

} # end for loop

This code (1000 iterations) runs in the blink of an eye:

i =   1, rem. = 19.00%; totals = (3772,3518,3902);
        partition = (1498,164,1090,1020),(1126,751,1641),(1602,553,1747)
i =   2, rem. = 13.56%; totals = (4439,3341,3412);
        partition = (1602,1090,1747),(553,1498,1126,164),(1641,1020,751)
i =   4, rem. = 13.18%; totals = (3963,3851,3378);
        partition = (1090,1747,1126),(751,1498,1602),(1641,553,164,1020)
i =   6, rem. =  4.78%; totals = (4303,4267,2622);
        partition = (1641,1747,164,751),(553,1126,1498,1090),(1020,1602)
i =  13, rem. =  4.04%; totals = (4301,4335,2556);
        partition = (1020,1126,553,1602),(1747,1498,1090),(751,1641,164)
i =  23, rem. =  0.26%; totals = (4478,4499,2215);
        partition = (1090,1641,1747),(1020,1126,1602,751),(1498,553,164)
i = 524, rem. =  0.02%; totals = (4499,4499,2194);
        partition = (1126,1602,751,1020),(1747,1498,1090,164),(553,1641)

The figure rem. is the percentage of space wasted on the discs that could be full — clearly, not all the discs can be 100% full. So in this case, I knew I was going to be burning three DVD-Rs, but there’s only 1 MB of unused space on each of the first two discs; for the third, I can either find some other files to backup, or keep those two albums to burn later — which is what I usually do; saving even more space, by repeatedly putting off burning the least-full disc.

, , ,

Leave a comment

Reports: a waste of time?

At a previous place of work, we had a central reporting server containing hundreds and hundreds of reports, covering every section of the business. Manifestly, reports are vital to running a company, especially a finance company that essentially makes its money by shifting data from one database to another. One day, a colleague decided to spring-clean the whole reporting structure, and what better way to start than pruning the dead wood, getting rid of reports that were no longer used — there have got to be at least a few, right? What he found was pretty astonishing: 95% of reports hadn’t been looked at in the last 3 months. Ninety-five percent! (Maybe some of those were quarterly, six-monthly or yearly reports; everything was archived, not actually deleted, so could’ve been re-instated if necessary. I don’t remember that happening.)

The lessons were clear: managers were asking for reports to be built, when they didn’t need them. Either they never needed them, or one-off data pulls would’ve sufficed. Now, reports aren’t free, they cost time and money:

1. Initial planning — any of: a quick chat, an email, a meeting, a proper design spec, a full project plan.

2. Writing the query behind the report, which is never as simple as SELECT * FROM Table WHERE CreatedDate BETWEEN @StartDate AND @EndDate. Often, data from different servers needs to be brought together, which involves asking a DBA to replicate data (something else that has to be managed and tracked). The query will undoubtedly involve some level of aggregation, which may mean creating new tables and scheduled jobs to keep the aggregate data refreshed (another place where milestone tables come in handy).

(Of course, if you’re lucky enough to have a comprehensive, well-maintained data warehouse, then the above might be greatly simplified.)

3. Writing the report itself, which might be a matter of minutes, or could take days if the output has to be, say, formatted for printing, and/or has hierarchical expand/collapse functionality, and/or has many input parameters, especially if they affect the layout as well as the data that gets returned.

4. Testing the report, deploying the report, getting it emailed out on a schedule, etc., etc.

Reports aren’t free, and to realise that 95% of anything was ultimately unnecessary, is as hugely annoying for the report writer, as it is for their manager who could’ve used their resource elsewhere.

So, lesson learnt, and the company went forth with a steely resolve to ensure that reports were only built if the business need for them could be proved beyond doubt, and this came to pass. Dozens and dozens of new reports were built, but each one was vital, and had a clear purpose.

You might be able to guess the next part… The same spring-clean exercise was repeated again, over a year later: 80% of the reports hadn’t been looked at in the last 3 months.

What’s to be gleaned from this? At the heart of it, is still the issue that reports are perceived as ‘cheap’. The only way to prevent this level of waste, is … somehow levy a charge on the original requester? Incur a penalty if reports go unused? Accept the whole situation as inevitable? I don’t know the answer. If you have any practical ideas, please let me know in the comments.


Some footnotes:

1. During the periods in which these reports were going unused, the business was constantly learning and changing. If, prior to the first exercise, I’d have been asked to guess how many reports weren’t used, I’d have said 30-50%.

2. I’m not trying to imply that my previous workplace was sloppy or inefficient, it wasn’t. I’ve seen this happen more than once, it’s just that the numbers were so striking that time.

3. The cleverest people I’ve ever worked for didn’t require lengthy reports to help them perform their duties, they could often elicit the truth from a handful of numbers. A skill I will always be envious of!

,

Leave a comment

Using median rather than mean

Analysts in consumer finance are used to dealing with large sets of incoming customer data, and are often asked to provide summary statistics to show whether application quality is changing over time: Are credit scores going down? What’s the average monthly income? What’s the average age of the customers applying this quarter, compared to last?

Early on in the process, the data is likely to be messy, unsanitised, and potentially chock-full of errors. (Once the data has been processed to the point of considering credit-worthiness, then it’s all clean and perfect, yeah..?) Therefore, you have to be careful when you report on this data, as it’s easy to get nonsense figures that will send your marketing and risk people off in the wrong direction.

Two really common errors seen in numerical application data:

1. Putting gross yearly salary into the net monthly income field, and it’s not always easy to spot which one it ought to be: if it says ‘£50000’, then it’s very likely to be an annual figure; but what about ‘£7000’? If monthly, that’s a really good wage; but I can assure you, people earning that much are still applying for credit.

2. Dates of birth: if unknown, it’s common to see ‘1st January 1900’ and similar. So when you convert it to age, the customer is over 100.

Also, if you happen to get credit scores with your data, you have to watch out for magic numbers like 9999 – which to Callcredit means “no [credit] score could be generated”, not that the customer has the credit-worthiness of Bill Gates or George Soros.

Hence, it’s fairly obvious, that if you’re including these figures in mean averages, you’re going to be giving a misleading impression, and people can infer the wrong thing. For example, say you’ve 99 applications with an average monthly income of £2000, but there’s a also an incorrect application with a figure of £50,000. If you report the mean, you’ll get an answer of £2480, instead of the correct £2010 (assuming that £50k salary translates to ~£3k take-home per month). However, if you report the median, you’ll get an answer of £2000, whether the incorrect data is in there or not.

In statistical parlance, the median is “a robust measure of central tendency”, whereas the mean is not. The median isn’t affected by a few outliers (at either end).


End note: Credit scores can be (roughly) normally distributed; for a normal distribution, the median and mean (and mode) are the same. But data doesn’t have to be normally distributed: e.g. call-waiting times follow the exponential distribution, where the median and mean are not the same.

, , , ,

Leave a comment

Converting JSON to XML : A Gateway to Cygwin

Of the files I have to deal with on a weekly basis, I’d put the breakdown at 50% Excel, 40% CSV, and 10% XML. This is fine, I can reliably transfer data from those files into SQL Server without too many tears. However, today I was presented with JSON-formatted versions of files, that I’d normally get as XML; and I haven’t had to deal with JSON since I last wrote PHP/AJAX code, about five years ago.

Now, SQL Server 2016 can natively read/write JSON code (see, for example, this MSDN blog), but I use SQL Server 2014, which knows nothing about JSON.

Of course, I googled for JSON to XML converters. There are many, mostly in the form of libraries for other systems, and even a few online converters that would do the job ‘in-browser’. Unfortunately, the data I needed to convert was credit file data, and that data is sensitive. I can’t just go pasting it into unknown websites without completely understanding what’s going to happen to it – if there’s the slightest chance my data could get uploaded and saved elsewhere, I can’t use that site. I did find an online site that did the conversion purely in javascript (no POSTs back to the server), so I copied the code locally, pasted my JSON in… and it crashed the browser (Chrome). Turns out 80kb of JSON was too much for this javascript, and in fact, a couple of the standalone converters I tried also had trouble with this (small) amount of code.

There was even a pure T-SQL converter (written as a function) that I tried, but unfortunately, that didn’t work out either. Which is a shame, as a SQL-based solution appeals to me greatly!

To cut a dull story short, here’s how I did it: perl. Thanks to the third most popular answer to this stackoverflow question, the answer was to open up a cygwin window, and type:


cat MyFile.json | perl -MJSON -MXML::Simple -e 'print XMLout(decode_json(do{local$/;}),RootName=>"json")' > MyFile.xml

(Thank you very much, stackoverflow user azatoth!)

And that did the trick; I had to do some minor tidying up (due to @ symbols, and node names starting with a number), but in the main, it did the job for me, with a minimum of effort.

The point of this post is two-fold:

  1. When this requirement crops up again, I only have to look here to remind myself, and…
  2. To spread the word about cygwin.

Cygwin (www.cygwin.com) is a way to get hold of Unix/Linux-style functionality on Windows. I’ve used it for years now, and it’s an invaluable part of what I do; it’s literally one of the first things I install on any new machine.

If you do any significant amount of text file processing, there are many great command-line tools to be found within the cygwin environment; just a few I use on at least a weekly, if not daily, basis:

  • grep: for searching through text files, using regular expressions
  • sed: for basic text transformations
  • awk: a programming language for text processing
  • perl: a programming language widely used in the *nix world

The beauty of these tools, is that they’re so widely used, it’s almost guaranteed that whatever you want to do, someone else has already put the correct syntax online (cf. my JSON to XML problem). Usually, some light googling (often incorporating the term ‘+stackoverflow’) will get you your answer. I wouldn’t claim for a second that I ‘knew’ these tools (apart from maybe grep), but being able to work with them is enough.

If you’re a developer or analyst who has to routinely work with data files, I can’t recommend cygwin highly enough.

, , ,

Leave a comment

Getting it all wrong

In consumer lending, there are many constraints that have to be satisfied before any money is paid out to a customer. Off the top of my head, the customer must be:

  • Affordable
  • In employment (not always, but usually)
  • Over 18 (and under some policy-defined upper age limit)
  • Not bankrupt or otherwise insolvent
  • Accepted by the credit risk scorecard
  • Not a PEP (Politically Exposed Person)
  • Not an SDN (Specially Designated National), someone the government doesn’t want us to do business with
  • Not suffering from mental health problems, or otherwise vulnerable
  • Not matched by our own ‘do not lend’ list – i.e. they don’t match (or have links to) previous frauds or undesirable customers
  • Accepted by the fraud risk scorecard
  • Not suffering from an addiction to gambling
  • Not currently in prison
  • Thought to be telling the truth about their financial situation

(The exact list isn’t important here.)

Some of these statuses are easier to determine than others, but it’s impossible to get any of them 100% correct. Credit files are certainly imperfect, and the data they contain is always going to be out-of-date to some extent; PEP/SDN lists are pretty fuzzy; and of course, some people lie about their financial situation in order to get a loan paid out.

For the sake of demonstration, let’s assume we’re 99% accurate for each one: we get it wrong one time in a hundred. Given that error rate, what’s the probability of getting the overall decision wrong — that is, not satisfying every constraint — and paying out to someone we shouldn’t? Simple probability tells us it’s one minus the probability of getting all of them right, so for the 13 constraints above, it’s 1 – (0.99 ^ 13) = 12.2%. In other words, we’ve a one in eight chance of paying out when we shouldn’t; the consequences of which could be loss of money, regulatory or legal issues, damage to reputation, etc.

You could break some of those constraints into smaller parts: e.g. affordability is a calculation that relies on both income and expenditure being calculated correctly. Hence, satisfying those 13 constraints is down to many more smaller decisions (a scorecard involves hundreds of calculations). In which case, our situation is even worse: at our 99% error rate, it only takes 69 decisions to make it more likely that our overall decision is wrong! (1 – (0.99 ^ 69) = 50.02%.)

Luckily, the real-life situation isn’t this bad:

  • For some decisions we’re far better than 99% accurate, e.g. “aged over 18” is reasonably simple to get right
  • We test our scorecards thoroughly (yes, we do).*
  • We corroborate information by obtaining it from several sources
  • The various decisions aren’t all independent (our calculation assumes they are)

* Note that we’re not concerned here with scorecard predicting correctly – just functioning correctly, in accordance with how it was built.

To be honest, I’ve never estimated what the actual “incorrect lending decision” rate might be (my gut feeling is it’s satisfactorily low enough), but I’ve put it on my list of things to consider. And, as ever, I know that I won’t have a hope of getting an accurate answer if the data isn’t readily available and undistorted.

,

Leave a comment

People aren’t people: when matching them can be hard

I’m going to spend a few posts on a subject that is, frankly, the bane of my life: people matching. That is, given two sets of person-related details, do I believe they are the same person? It’s eminently useful for many things including keeping marketing costs down, improving customer service, and very importantly, preventing fraud.

If your dataset(s) contain a unique person key, e.g. Social Security Number in the USA, or National Insurance Number here in the UK, then the task is obviously pretty simple (barring errors in the data). If there’s no unique person key, you’ve got a great deal more work to do. I’d say it follows a 95 / 5 rule: to match the first 95% of your dataset takes 5% of the time, the 5% that’s left takes the remaining 95% of the time. (Hence why it causes me grief: you can end up writing reams of code to match a handful of details, in a never-ending quest for greater accuracy!)

Before I start discussing how I’d do people matching in a “perfect world” scenario, I’m going to list some of the problems I’ve encountered when trying to match data from UK sources.

Names

  • Shortened or alternative forms of the first name: e.g. Bill / William, Peggy / Margaret, Jack / John. And these days, Alfie probably isn’t short for Alfred, just as Harry probably isn’t short for Harold (or even a boy’s name).
  • As per the above, I wouldn’t ever assume a particular first name implies a gender; you’ll be wrong at some point, and an awkward conversation might ensue.
  • Using middle names as first names; famous examples include Hannah Dakota Fanning, William Bradley Pitt, Walter Bruce Willis, James Paul McCartney, Laura Jeanne Reese Witherspoon.
  • Married names, people taking their spouse’s last name, without any restrictions on gender.
  • Double-barrelling last names with spouse or partner.
  • Very common names – names like ‘George Smith’ and ‘Claire Wilson’ mean placing more reliance on other pieces of information when matching.

Titles

  • In my experience, Mr/Ms/Miss/Mrs etc. are rarely correct enough to rely on to indicate gender or married status*, even when the primary source is data the customer has entered themselves. Also, the gender-neutral Mx is becoming increasingly common.
  • Let’s not even get into the realms of Professor, Doctor, Lord/Lady, Reverend and assorted military titles…

* Using gender and married status purely as aids to matching people, nothing else.

Dates of birth

It’s very easy to get the date of birth wrong with mis-typing, or getting the month and day the wrong way round. Also, people (a) don’t like to give their birthdate out, so may give a dummy one (1st Jan 1970 is common), or (b) will lie about their age if they think it improves their chances of obtaining a product or service.

People with “non-traditionally British” names

  • People from other countries adopting a Western-style first name alongside their traditional birth-name (e.g. Chinese people).
  • First names / family names may not be in the expected order (again, e.g. Chinese).
  • Names that have more than one translation into English, e.g. Mohammed / Muhammad / Mohamed.
  • Different character sets! Greek, Cyrillic, Arabic, etc.

(“Non-traditionally British” is an ugly turn of phrase, there must be a better way of putting it…)

Family

  • Fathers and sons with exactly the same first, middle and last names. (Far more common than you’d think!)
  • Twins; especially twins with very similar first names (Mia/Mya, Ethan/Evan).
  • You can’t reliably infer relationships using only differences in age; two customers from the same family, 32 years apart, could potentially be siblings, parent/child, or even grandparent/grandchild.

Addresses

  • Living at more than one address; in particular, students living away from home.
  • Moving house, sometimes within the same postcode, or even next door.
  • Postcodes not existing yet on the Postcode Address File, although you may find them on Google Maps(!)
  • Postcodes becoming invalid / retired, e.g. postcodes in the districts BS12, BS17-19.
  • Postcodes becoming valid: the district E20 was previously used only for the fictional TV soap Eastenders, but postcodes in this district have now started to be allocated for real addresses.
  • Roads can be renamed [BBC]
  • Buildings can be split into flats.
  • Different naming conventions; flats in Scotland can be named by floor number / flat number, e.g. 2/1 (2nd floor, 1st flat).

Some address-related problems can be solved by using the Unique Property Reference Number (UPRN) or the Unique Delivery Point Reference Number (UDPRN) to represent the address, but neither of these has widespread adoption yet.

Email addresses

  • Having more than one email address.
  • Labels, e.g. fred.smith+SPAM@mailbox.com and fred.smith+NOTSPAM@mailbox.com. The canonical version of the email address would be fred.smith@mailbox.com, which may be more useful for matching purposes.
  • Temporary inboxes, e.g. Mailinator.
  • Format: Validating the syntax of 99% of email addresses is straightforward, getting the full 100% is almost impossible. See here [wikipedia] for a brief explanation about which characters are allowed in an email address.

Mobiles

Home phone numbers

  • Having more than one home phone number.
  • Not having a phone number, but entering one belonging to a friend or relative.
  • Not having a phone number, so using the number of a local taxi firm, public house, or fast-food restaurant (again, more common than you might think).

Bank accounts

  • Having more than one bank account
  • People not moving their bank accounts when they move house. (I live 80 miles away from my nominal branch.)
  • Sort codes changing, especially when banks merge or split.
  • Joint bank accounts
  • Business bank accounts

Debit and credit cards

You almost certainly shouldn’t be storing card details…! [www.theukcardsassociation.org.uk]

Incorrect details

  • Accidental mis-typing
  • Deliberate fraud – typically, the name and address might be real, but the mobile and email will be the fraudster’s.
  • System-testing : internal (dev or UAT environment) vs. external (penetration testing), manual/automated, regular (e.g. employees) / irregular (e.g. competitors testing capabilities; hackers!)
  • Details not existing: some people don’t have home telephone numbers (so put their mobile number in that field instead), whereas other people don’t have mobiles (so they put their home number instead).
  • People just messing around, possibly not-very-maliciously.

Other

  • Older people using younger family members’ email addresses and/or mobile numbers.
  • People who work overseas and have non-UK mobile number and address; they could be a valid customer, as per your policies, but with only non-UK contact details. Do your systems accept a phone number that doesn’t start +44?
  • Driving License / Passport : most existing systems only validate the format of the identifying numbers, which makes them a target for fraudsters. Newer systems can validate images of the documents.
  • Device IDs are great for fraud detection, but can present a problem when matching people; families often share devices, and what about public computers in libraries and internet cafes?
  • Electoral Roll: Being on the full electoral roll at an address is no guarantee that the person is living there, and the converse is also true.

Third-party services exist to validate/verify almost all the information above, singularly and together. However, none of the services are perfect, so matching person-level data comes down to cost (third party data and development time), and your tolerance for mistakes – how embarrassing might it be if you get it wrong?

If you have any examples of when matching personal details has proved trickier than you thought it was going to be, please let me know in the comments below!

, , , , , ,

Leave a comment

Introduction

Hello, my name is Pete and I’ve been building and maintaining Microsoft SQL Server databases for nearly 20 years. At no point has my job title ever mentioned SQL or databases; I started off building websites that used SQL Server as a back-end, and have somewhat drifted over the years to become a statistical modeller and analyst, but I still spend a large portion of my day typing SQL commands into Management Studio. I’m no expert in any aspect of SQL Server (and certainly no DBA), but I’ve built up enough experience to feel comfortable blogging about how I use it: what works for me, and what doesn’t.

Hence, I don’t claim that anything I write will be definitive, authoritative, or the ‘best way’ of doing it. I’m always learning, and SQL Server itself is an evolving product.

I think I should also mention that I generally work on databases up to a few hundred gigabytes in size. These days, such sizes are not considered “a lot of data”. I’m well aware that methods that work for smaller databases (e.g. databases measured in gigabytes) will not always perform as well for large databases (multi-terabyte and above). Caveat lector, “let the reader beware”!

, ,

Leave a comment