Archive for category chat

Diary Entry #1

Posted by sqlpete in chat on April 17, 2024

A cartoon of me writing in my diary

If there’s one thing I hate, it’s not being busy. I can’t really switch off, I’m always thinking about tech I could be playing with. In that vein, here are a few of the diversions that have been occupying my time since my last post on here.

Warning: rambling ahead!

Poking My Nose Into Other People’s Business

A cartoon of an elephant operating a computer

As a boy, I drove my parents nuts by forever taking stuff apart. There wasn’t an electrical or electronic appliance in the house that I hadn’t had in pieces at some point, and I wasn’t the best at putting them back together. My bedroom floor was strewn with circuit boards, motors, lights, speakers, batteries and general metal junk. Because I’m a grown-up and can’t do that any more (bah), I like to pick apart other people’s code and algorithms, and see how they work.

Lending Calculators

A while ago, one of my projects at work was to figure out how the lending calculators for various car finance firms worked – or in a couple of cases, didn’t work.

Anyway, I found this a great deal of fun, so I considered starting a side blog detailing all my efforts in this area. However, I ran the idea past a friendly compliance person (Hi Debs!) who advised against it: big companies don’t like to be shown to be wrong, and have expensive lawyers to set on anyone who implies otherwise. (Yes, I could still do it if I anonymised the companies involved — I might still do that, but it’s extra work with screenshots, etc.)

[Aside: There was one particular online calculator, not affiliated to a particular company, but it was flexible enough that it covered the ways in which car finance is different to usual consumer lending. I played with it for days, but could not replicate its results. In the end, the CFO wrote to the author of the calculator, and paid good money for a copy of the code. Not only was the code bad, it was quite wrong, provably so. No wonder I couldn’t figure the numbers out!]

A few weeks ago, I idly started playing with the payment calculators on various lending sites. One in particular caught my eye, because it had a unique feature to it (which I’m not going to expand on, or it gives the company away). I realised that, behind-the-scenes, the website called a publicly-accessible API, so I coded up loops in SQL to automatically query the API with random values, and then analysed the results:

The code on the website that called the API had at least two minor bugs, which meant changes to the UI elements weren’t reflected in the calls to the API.
The JSON returned by the API call had some… interesting variable naming. I expect the code was imported from the US.
The figures were being rounded, intra-calculation, in a way I’d not seen before.
I could only replicate the results about 95% of the time — the other 5% only made sense if the interest rate was different to what was stated.

I’ve no idea how my investigation would be received if I emailed the company about my findings. I’ll keep it under my hat for now.

LoanSims

A cartoon of an older lady as a robot

It’s been at the back of my mind for years to create a lending company simulator: that is, starting with a budget of X, lend out Y new loans per day (according to some distribution of terms/amounts), collect payments from pre-existing loans (subject to parameters relating to propensity to pay back), and track the performance of the company day-by-day. As part of my long-term goal to make my ‘go-to’ coding language Python, I wrote LoanSims, my first pass at this app (named for legendary ‘Carry On’ actress Joan Sims).

I was flush with enthusiasm for it, but then spoke to a friend who has much more experience in the industry — he said he’d tried in the past to sell something similar to lenders, but weirdly there was little-to-no interest. Which dampened my excitement — there’s no point spending ages on code no-one will ever want, even if it is stimulating, academically, to build it. Onto the back-burner it goes!

Sus(s) Udio?

Udio is the latest AI music generator, easily stealing the crown from Suno. From their About Us page:

Udio builds AI tools to enable the next generation of music creators. We believe AI has the potential to expand musical horizons and enable anyone to create extraordinary music.

udio.com allows users to create music from simple text prompts by specifying topics, genres, and other descriptors which are then transformed into professional quality tracks.

At time of writing, I have generated 315 pieces of music using Udio — it’s currently in beta, and free (there are daily limits, but I’ve not hit them yet). You start off by giving a text prompt — e.g. “A song about beating the devil at poker, then drinking his whiskey and stealing his horse. Nineties shoegaze” — which returns two 33 second pieces of music. You can then extend either piece of music (choosing between intro / before / after / outro), or ‘remix’ it — then keep repeating the process.

A cartoon of a workman playing a keyboard and singing

The results are stunning. Eight times out of ten, the music is indistinguishable from a genuine track. It’s not perfect, and the auto-generated lyrics are cringey — but the vocals are so strong, it doesn’t seem to matter so much. (And you can provide your own lyrics!)

Because it’s one of my favourite periods, I’ve been generating music that sounds like 1970s art-rock, and while I can’t get anything that sounds close to David Bowie (you can’t use real artist names in the prompts), it’s pretty easy to get vocals that resemble Russell Mael (from Sparks), Jon Anderson (from Yes), or even Peter Gabriel or Phil Collins! (Although, so far, not Ian Anderson from Jethro Tull.)

I’ve had trouble with it not adhering to the prompt, e.g. trying to get a Eurovision-sounding song from the 1970s, it gives me 80s metal, or 2000s pop. But because it only takes a few seconds to re-generate a new piece of music, you can just try again.

It really is quite addictive. I know some people out there will be horrified at the idea of it: “music is art, it’s not the domain of algorithms”. Really, I have no answer; I have a houseful of keyboards and guitars, and I don’t feel it’s taking away from my enjoyment of making my own music. But then again, I don’t earn my living from making music.

Other ‘stuff’

Applying for jobs: It seems it’s not a good time to be applying for tech jobs: it’s very difficult to get recruiters to call or email back, I’ve not had a single acknowledgement for any role I’ve applied for, whether directly or through a recruitment site, or LinkedIn. Trying times!
AI not coming for us — yet: A few weeks ago, a story did the rounds about an AI developer, ‘Devin’, that looked like it could replace a whole software engineer; luckily, the truth was rather less sensational. This video is well worth 25 minutes of anyone’s time: Debunking Devin: “First AI Software Engineer” Upwork lie exposed!.
Pete Tonkin, Hardware Guy: I am not a hardware guy. Yes I build my own computers, but inside they are a MESS. However, I had to replace the screen on my son’s laptop recently, and it went smoother than these things usually do! I’ll take the win.
General file admin: I have 25 years of hard drives, CD-Rs and DVD-Rs with umpteen projects in various states of completion, so I’ve tasked myself with collating all the files, with a view to creating ONE single repository of all my work. It’s a nice goal to have, but realistically the chances of me completing the task are slim-to-none, because something more interesting is always just around the corner…

I might do more of these ‘diary entries’, if only to remind myself what it is I’ve been doing, and what tech I’ve been playing around with!

consumer lending, devin, diary, job-hunting, loan calculator, udio

Leave a comment

Good practice: MD5 everything

Posted by sqlpete in chat, systems, process on March 5, 2024

If you’re tech-savvy, you more than likely already know what MD5s are and what they’re used for, and hopefully, why they’re important. If that doesn’t apply, please read on…

In the previous post, I mentioned that I’d received a hard drive full of database backup files. Given that the drive was sent to us by a multi-billion dollar tech provider, I was disappointed to notice there was no md5 file. An md5 file is a simple text file containing 16-byte hashes of files, they look like this inside:

efe55da6f1f14dc194a100319a2738da *Egyptian fraction decomposition.R
583d89e10ed6cef093ba2989c95e3087 *std_functions_predict.R
04c7d6c9ac565236fbec6b3129e53342 *Term Calculation 2020-10-28.7z
160efdf20d1a677edfa07f6cb28ac498 *Term Calculation 2020-11-29.7z
510316459eadc3dfa404ea7de69db8f2 *Term Calculation/2020-11-30/2020-12-01 2042.RData
6dbc680a15f8f2f92405f93658155c27 *2019-07-27/hexadile/Franel/franel.pdf

Before we go any further, some terminology to keep things clear:

MD5: a hashing algorithm
md5 file: a file containing MD5 hashes, the Windows extension is .md5
source file(s): A file or files that we run the MD5 algorithm over to generate an md5 file
hash check: re-calculating the hash for a source file to check that it agrees with e.g. the hash in an accompanying md5 file

A hash is a representation of a source file’s contents, and it turns out you only need a small number of bytes for the hash to be incredibly useful. MD5 is just one algorithm, and there are others, e.g. Secure Hash Algorithms SHA-1, SHA-2 and SHA-3. Most of what I’m saying about MD5 here applies to those other hashes, I’m concentrating on MD5 because it’s so commonly-used.

Why do we use MD5s?

Several reasons, but the main application I’m concerned with here is file integrity. I want to know definitively that the files I receive or download haven’t been corrupted in any way by errors due to transmission or copying.

To take my original example, the tech provider should’ve MD5’ed the source files on their own machine, before copying them to the hard drive they sent us. I could then check for myself, using the md5 file, that the source files hadn’t been altered or corrupted.

When to use MD5s

I use them every single day! Every time I move any files between drives, I generate an md5 file first, and verify the destination files before I delete the original. The only times I don’t are (a) when moving a file around on the same drive; and (b) when the file is trivial to recreate (e.g. I’ve pasted the information in from a web page).

Note that some file formats have hashing ‘built-in’: for example, most (all?) compressed ‘zip’ formats; also the SQL Server .bak files I mentioned above.

For the even more careful person, the elderly but much-loved WinRAR can add a ‘recovery record’ to the .rar files it creates. This makes the .rar file slightly larger, but it can self-correct a certain number of errors, should the .rar file become corrupted. (See also PAR2 files.)

Also, torrent files contain hashes (SHA, not MD5) for pieces of the files being downloaded — however, there’s no single ‘overall’ hash for each file referenced by the torrent. Recently, I learned about pyrocore: “pyrocore is part of the PyroScope family of projects, and offers a collection of tools for the BitTorrent Protocol and especially the rTorrent client”, including hashcheck which lets you validate a download using the .torrent file, outside of a torrenting app.

Can’t I trust Windows to move or copy a file?

Frankly, no. I know this isn’t an official Microsoft statement, but still: Does Windows perform CRC when copying files between drives?: “Windows does not perform error detecting when copying files. That would introduce extra overhead that would slow a computer’s performance.” (A ‘CRC-check’ Cyclic redundancy check is similar to a hash check, people often use the terms interchangeably.)

It’s patently clear to me that Windows doesn’t do any checking when copying or moving files. I’ve been using Windows since version 3.1, and I would say that prior to Windows 7, copying files between drives, and around a network, I’d be experiencing failed copies (i.e. the destination file having a different hash to the source) multiple times per week. The situation is much better now, but it’s not perfect — I’d say I see 1-2 file transfers failing every 6 to 8 weeks.

Two current examples:

I downloaded a large zip file (~500MB) from Companies House last week (see my blog post SQL Server Full-Text Search for context), and there was an all-too-common network interruption; my Chrome browser let me resume the download. When it finished downloading, the zip file was corrupt (the ‘Test’ functionality failed within 7-Zip). If it wasn’t a zip file, I wouldn’t have had a way of checking the file integrity, as the website didn’t provide a hash for me to check.
I have a USB hub plugged into my PC to make connecting/disconnecting USB devices much easier. If I use two external HDDs at the same time, I get corrupted reads/writes around 50% of the time. If I use one drive at a time, zero problems. But Windows doesn’t ever alert me to these errors! I soon learned this was happening solely because I use md5 files as a force of habit!

How do I generate a .md5 file?

My standard ‘go-to’s are (a) MD5summer for Windows, it’s about the first thing I add to any new Windows installation; and (b) md5sum in Cygwin — as I’ve mentioned before, I use bash scripts in cygwin all the time, for manipulating and transforming files. I have a few custom scripts that I run to generate .md5 files for files that don’t already have them, and to verify .md5 files after copying.

Additionally, PowerShell in Windows has the Get-FileHash function, e.g.:

Get-FileHash -Algorithm MD5 folder\*.bak

which is fine for a single hash here and there, but as far as I can tell, it’s not straightforward to generate a standard .md5 file. (If I have to use PowerShell for this, it only takes a few seconds to convert the output into standard .md5 format using Find/Replace in Notepad++)

Other uses for MD5

Above, I’m only talking about the use of md5 files when moving or copying files around, but there are other uses, for example:

De-duplicating files: if I’ve thousands of files in a folder, some of which might be dupes that I want to remove, then I can generate an md5 file for the folder, open that file in Excel, then sort the hashes to find the dupes.
I use MD5 hashing extensively in databases when I’m storing blocks of XML. I typically have two indexed columns with MD5 hashes in: one hash for the compressed XML, another for the original uncompressed XML. Not only does it help with data integrity, but if I’m inserting or updating the XML blocks, I can use hashes to check whether the data already exists.

Wrap-up

Some people might read this post and think I’m being fussy and over-the-top — sure, maybe. But I see it as good data and time management: the time spent incorporating MD5s into my everyday workflow is far less than the time I would spend having to recreate, re-download, or replace corrupt files.

xml, cygwin, PowerShell, MD5, md5sum, CRC, pyrocore, SHA, md5summer

Leave a comment

Some AI thoughts

Posted by sqlpete in chat on November 5, 2023

In many parts of the internet, AI and LLMs (large language models) are all anyone can talk about. I’ve barely ‘dipped my toe in the water’ when it comes to using AI, but I try and keep abreast of the news. This latest incarnation of AI could potentially have as big an impact as the internet itself, and I don’t want to be left behind!

What follows is a rather loose collection of thoughts, observations and links (yes, I could get AI to tidy it up for me…). My stance is broadly pro-AI, I believe it has the potential to be a fantastic tool; but like all tools, it can be mis-used. Please do let me know in the comments what you think about AI, good or bad!

Four cartoons of a hipster trying to work whether AI is good or bad

AI I’ve personally used

Among the many different ‘chat’ AIs available (as opposed to image-generating, or some other kind), I’ve only used ChatGPT, Bard and Claude, but I’ve seen demonstrations of all the big players. I’m beyond impressed at what they can do, sometimes it’s almost indistinguishable from magic — but to date I’ve not found a compelling everyday use for them.

I have tried to get ChatGPT and Bard to generate SQL DDL statements to create sets of tables based on a text description, and the results were… disappointingly basic. By the time I found the combination of prompts that worked, I could’ve written the SQL myself. Personally, I’ll be very excited when any of these AIs can consistently work with undergraduate-level mathematics and associated equations, that will be absolutely game-changing for me. I’ve seen demonstrations where they can be ‘hand-held’ through some standard mathematical proofs, but they’re not doing anything interesting yet.

(Actually, I did use ChatGPT to help me better understand some problems I was having with Welsh vocabulary, it saved me trawling through some obscure forums.)

How do I get my AI news and info?

I follow these Youtube channels for my AI news:

Matt Wolfe, who runs the Future Tools website. His coverage is more broad than deep, it’s aimed squarely at non-technical people who want to learn how to use AI to improve their daily/working lives. (I think he’s a great presenter.)
AI Explained is a more technical channel, but not overly so. Everything is explained clearly.
Two Minute Papers has an excitable and good-natured presenter that looks at the latest academic papers relating to AI, mainly in the imaging/vision space.

All three post regularly, there are seemingly major advancements every single week. Of course, there are hundreds of channels to choose from, these just happen to be the ones I’ve enjoyed (and subscribed to).

Generating Images

I’ve always been interested in how computers can help us create images — in the 80s, one of my favourite pastimes was creating psychedelic patterns and noises on my Acorn Electron, I have dozens of C90 tapes somewhere full of programs! The past year has been a particularly thrilling time in image generation — if we can describe it in words, an AI model create it for us!

For examples of output from all the big image generators, I follow these subreddits:

As mind-blowing as many of the images are, it strikes me that the styles represented in these subreddits are less diverse than you might expect; I see very similar-looking images again and again. I don’t know if that’s a paucity of human imagination when it comes to prompting, or limitations of the image generators themselves.

The only image generator I have any direct experience of is Image Creator from Microsoft Bing (aka Bing Image Creator). More on that further down.

Artists against Image Generators

On twitter, I follow quite a few people in the comic book world^* who are vocally very anti-AI, viewing these image generators as nothing less than facilitating the wholesale theft of their work. I understand their point of view, if I don’t really agree with it. Even if the ‘theft’ issue is addressed, and training sets become opt-in only, there are plenty of artists who are happy for their work to be included (and you don’t have to look hard on twitter or reddit to find them). Also, in many practical ways, ‘the cat is out of the bag’; even if the various lawsuits worked, there are multiple open source image generators out there that couldn’t be litigated out of existence overnight.^**

^* I’m not really into comic books, they’re just generally nice folk
^** (From a few days ago) Looks like the first round of legal battles has not gone the artists’ way: Judge dismisses most of artists’ copyright lawsuit against AI image generators

But: life has taught me that if lots of artists are upset about something, it’s wise to listen.

Cartoons

A month back, ‘AI Educator’ Chase Lean tweeted:

OpenAI just dropped DALL-E 3 a few hours ago, and it’s a game changer for making comics.
I made the comic strip below in less than 15 mins.

(original tweet here)

Comic artist and writer/illustrator Nigel Auchterlounie quote-tweeted his response with a cartoon (that I won’t reproduce here) that amply illustrated his comment “Where others see a future where anyone can be an artist, I see a future where no one is.”

Nigel’s cartoon is clearly visually more appealing and funnier than the one Chase generated, or anything I’ve seen associated with AI. Based on the current state of AI-generated cartoons and comics, I feel that cartoonists and comic book artists are generally safe from being supplanted, whether models are used to generate story, dialogue, visuals, or all three.

Even teaching about AI can cause controversy

On October 16th, 2023, the Wales Millennium Centre posted this tweet:
WMC tweet

The text reads:

Empower your comic book dreams with AI! 
Join our two-day course to craft your first comic or graphic novel.
Let's bring your story to life. 
Available to anyone aged 14–25

The link was https://wmc.org.uk/en/whats-on/2023/creating-graphic-novels-with-ai, which has since been deleted (and isn’t even available at the Internet Archive, or in the Google cache).

Many artists took great exception to this course — just go and look at the replies to the tweet. Noted comic book artist Marc Ellerby quote-tweeted:

I find this whole thing so gross, anyone associated with this course should be embarrassed. Comics has an ugly history of profiteering, stealing and disrespecting artists and this AI course is telling people “hey, all of that stuff is okay, it’s only comics”.

(click here to see this tweet.) See his timeline for more thoughts about the course…

Again, because I find it hard to see how AI will get to the stage where it can create better comics than human artists, I think the outcry was an over-reaction. The WMC has since cancelled the course. (You can read more at Broken Frontier, here.)

But what does ChatGPT think?

Annoyingly, I ‘get’ both sides of the issue: artists think the companies who make image generators are profiting from their hard work, without consulting or properly compensating them; meanwhile, more ‘techie’ people see all this as progress — if it’s possible for a computer to do, why shouldn’t it? I lean on the latter side, precisely because I just can’t see any AI (at least in the next 5 years) coming up with anything near as good, as culturally valuable, as a talented human. I don’t think it IS a threat. But I’m aware that it’s not my job that’s being threatened.

One argument in favour of image generators that I do find persuasive is “What’s the difference between training a human and training an AI? Humans learn by copying”. To get an opposing view, I asked ChatGPT to come up with some arguments against “there’s no difference between training a human, and training an artificial intelligence model”. It came up with 9 paragraphs of text (which you can read in this PDF). I asked it to summarise it down to one paragraph, and here’s what it said:

The argument that there’s no difference between training a human artist and training an AI model to mimic artistic styles overlooks significant distinctions. Human artists consciously mimic styles with creative intent, attribution, and cultural awareness, whereas AI models lack intent, may not attribute sources, and often operate without understanding cultural contexts. Human artists also develop their artistic voices and can be held accountable for their work, whereas AI models lack personal expression, accountability, and transparency. These differences emphasize why the actions of human artists and AI models in generating art can be perceived differently and raise ethical and legal concerns related to copyright, cultural appropriation, and artistic integrity.

Is the above persuasive? I’m not sure.

On the firing line?

There clearly are sectors, right now, that are in danger of having human artists removed from the equation. For me, the two obvious ones are (a) book cover design, and (b) ‘clip-art’-style images. Book covers are so incredibly generic these days, satirical magazine Private Eye used to run a ‘bookalikes’ column, featuring different books with almost identical covers.

Image Creator from Microsoft Bing

As I mentioned above, I do regularly use Image Creator from Microsoft Bing (aka Bing Image Creator). It’s nowhere near as full-featured as some of the bigger players (DALL-E 3, Midjourney), but it couldn’t be simpler to use. You type in a simple description, and after 20-30 seconds, it gives you back 2-4 images that it thinks represents your text input. The results vary, but so far I think it’s perfectly usable.

Book covers

A writer friend mentioned that he liked playing around with his own book cover art, so I suggested Image Creator to him. For a demo, I typed in something (can’t quite remember the exact phrase, but it was something like ‘a book cover of a man holding a gun, his back is to us, he is facing a waterfall’), and got these three images back:

Some book covers

As you can see, only one exactly matched the input prompt — but I thought the middle one was particularly good so I upscaled it (from 1024×1024 to 4096×4096, using The Ultimate Image Upscaler Tool), and sent it to him. His one-line reply was “Omg that’s amazing”.

Writing books

Last time I had a coffee with that same writer friend, we spoke at length about AI and LLMs, as he’d been playing with Claude on my recommendation. I’d also sent him a link to How to Write a Book in Claude 2 (Introducing the Super Prompt), because I’d watched it and thought it interesting. As with artists’ situation, I don’t think AI/LLMs are anywhere near good enough to ‘write’ books that I’d want to read, but they could be invaluable tools.

Yesterday I saw this post on r/singularity: When will an AI be able to write a novel that is worthy of a literature prize?, and there was a comment that very much echoed my friend’s sentiments:

I think if you struggle with description of a scene or have writer’s block, LLMs can be of some assistance. Where it really shines is providing feedback and analyzing a story. However programs like grammarly and others have done this for awhile and I’m sure will start incorporating ai more.

Where it’s really lacking is continuity. Characters have no foundation or framework and forget about character arcs. The writing style is also all over the place and it lacks pacing both within scenes and the larger story.

(direct link to comment here)

Clip art

Often, I just want some images to ‘pretty-up’ a document or post, make it less boring. The image at the top of the page is from Image Creator, as are the images on my previous post, Managers: How to get the reports you need.

How I see my immediate future with AI

Personally, I can see me using it in some small ways in the short term:

Giving me some C# guidance for the Welsh-language learning app I’m building
Teaching me the finer points of Welsh vocab
If I get back into songwriting again, helping me write lyrics to fit my melodies, my least favourite thing!

A cartoon of an older man on stage, holding a guitar
(Yeah, Bing Image Creator again, with some not-too-successful outpainting by Runway)

In a work setting, I have some ideas around using GPT-4 to investigate and hopefully illuminate some business mechanisms, where there is a lot of analytic data, but also a large amount of written, subjective data, if it can be collated. Hopefully something to be getting on with in early 2024.

AI, art, Bing Image Creator, book covers, books, Claude, clip art, DALL-E 3, LLM, midjourney

Leave a comment

Managers: How to get the reports you need

Posted by sqlpete in chat, reports on October 31, 2023

A man on a ladder scaling a hugh wall-sized report

Mild disclaimer: I wrote this over two years ago, on the back of finishing a role that was heavily reports-focused.

Intro

As someone who has wrangled data and built reports for decades (gulp), I thought it might be useful to feed back to those people who request/commission reports as part of their role.

The situation we’re talking about here is:

You’re someone who needs reports (and/or raw data) to do their job — you’re probably a manager or team leader
You don’t currently have all the reports you need
You don’t generate those reports yourself: your company has a BI/MI/stats/data department whose responsibility it is to generate those reports
That department doesn’t just immediately build you any report you request(!)

How do you get the reports you need, in the most efficient way? I’m not advocating ‘self-service’ reporting here, which Microsoft in particular are keen to push; sure, that’s an option for a disciplined, well-resourced company; but often it’s impractical for an average smaller business.

In this post, I’ll list what I think are the most important factors in enabling you to get your report built. (For the sake of brevity, I’ll use the umbrella term ‘reports’ to cover wallboards as well, nearly all of the points apply in the same way.)

From hereon: ‘You’ = the person requesting the report, ‘We’ = the BI/MI/data team as a whole, ‘report builder’/’builder’ = the member of the BI/MI team that is coding up your report (which could in reality be more than one person).

But first…

A. Reports are a pain

That’s right: reports are a pain (for various reasons) and I personally don’t relish building them. Why? Because:

1. Specifications

Usually non-existent. If you’re lucky, you’ll get a tiny paragraph of text that becomes less informative the more you read it. For some reason, laying out precise details for reports isn’t afforded the same rigour as projects given to engineers. Often, you’re starting off with vague, unfocussed requirements – having to guess what the requester actually wants from the work.

2. Data assumptions

Non-database people tend to assume that it’s ‘just’ a matter of taking pre-existing data, and displaying it. But it can easily be the case that the data you’re interested in isn’t immediately available, and has to be derived, or sourced externally. For example:

“I just want the postcode from the address field”… but the address details are all in one line of text, not always comma delimited, and the postcode may not be the last item. That’s not trivial to extract.
“Just the principal balance” except this wasn’t stored anywhere; the only place it ever existed was when the interest was recalculated, deep inside some C# code. It took a major project to get this data into the database to be able to report on it.
“Group the addresses by county” If the address data doesn’t have county as a separate field, then we have to find a data source that’ll let us map between postcode and county. (Then if you’re really nerdy, you can get into debates over what type of county – administrative, ceremonial, historic, etc…!)

3. Wasted resource

I’ve written about this before: Reports: a waste of time?. Reports get built, but either never get used, or become redundant very quickly.

NB: It’s usually easy to track report usage, but not so easy with wallboards; in a recent tidy-up, we simply turned the more obscure ones off, and only turned them back on when people complained… blunt, but effective!

4. Maintenance

We don’t build a report and that’s it done and dusted forever — they require constant maintenance. Underlying data and architecture changes, and when it does, all the reports built over that architecture need to be updated.

5. Apps

Reporting apps aren’t that similar to each other: Tableau is nothing like Reporting Services (SSRS), which is nothing like Power BI, etc. Working out how you want the report delivered to you is important, and shouldn’t be an afterthought.

6. Specialisation

In large enough companies, ‘report builder’ is an entire dedicated role in itself, and often an under-appreciated role, at that. A CEO once told me, in good humour, that one of my reports was “the ugliest they’d ever seen”, and he wasn’t wrong — I’m the first to admit I don’t have an eye for the aesthetics. An old colleague and friend of mine (Hi Timmy!) used to build beautiful reports that looked like the outline of animals, without compromising the communication of the data. I wouldn’t recommend trying this unless you’re very talented…

7. People

Reports are things that get interacted with, and it’s not always clear how someone will interact with a report you build for them. A personal observation: I’ve been spoiled with most of the managers and directors I’ve worked with — in my experience, the best ones need only a few core reports to keep on top of their area of the business, and their reports rarely change. They’ll tell you straight away if the report is showing incorrect numbers, and they’ll probably be able tell you why, too. Conversely, I can always spot someone new in a management role by the number of reports they request!

B. Matters to nail down before proceeding

Despite the above, I live in the real world and recognise that reports are a necessity. Here are some observations that should help make it more likely that your report gets built.

1. Do you actually need it?

The most fundamental question: do you need this report? Do you really? Here’s one way to tell: could you justify it to the head of the company? I’ve been in meetings where the ‘big boss’ has asked what I’m working on, I talk about the report I’m creating, only for them to berate the person who requested the report for wasting resources, because they hadn’t done their homework.

People will more readily request reports to make their life easier if they don’t perceive them as costing anything, but they do: they cost time, so they cost money. And it costs to maintain them, keep them running.

Common tactics for managing an increasing number of requests for reports include (a) insisting on a minimum level of specs; and/or (b) requiring sign-off from a senior member of staff; and/or (c) even charging the requester’s department. These can cut the flow of requests significantly.

An old colleague always said they never even responded to requests to build reports until at least the 2nd email, and that worked for him!

2. Does it already exist?

‘Does it already exist?’ might seem a stupid question, but I bet you a shiny pound coin that every single report builder reading this has spent time creating a new report, only to find it already existed in some form. There is a joint responsibility (you as the requester, us as the BI team) to check for this before starting any work.

Equally, if you know the report you want is related to, or a modified/expanded version of, an existing report, do make that known! It could save the report builder a great deal of time. (And if you know who built the original report, and any details around it, communicate that too. The more information, the better.)

3. Other considerations

Let’s say we’re at the stage where the need for the work has been recognised and agreed, and we know the report is going to be built. Next, we should ask:

Is this really a report you’ll need on an on-going basis, or is it a one-off data dump? Is it something a member of BI could run for you once every three months, instead of it being a report?
What are you expecting? A new report on your Power BI dashboard; an Excel workbook; a new wallboard; a CSV file? Be explicit.
How are you viewing it? On a desktop 100% of the time? Or 25% desktop, 75% mobile?
Does it have a shelf-life? E.g., will it be redundant next financial year because the company is moving to a different finance system?
Is this really just one report, or is it two? A suite of reports?
Who else needs access to this report? Your juniors might not have the same data permissions as you.
How ‘up-to-date’ does the data need to be? Is ‘as of midnight last night’ ok? If you need a report that is ‘as close to live as possible’, can you justify it? Chances are that it might need to be a wallboard, which may not even be BI’s domain. (In most places I’ve worked at, the engineering team are responsible for ‘live’ queue boards.)
Does it need to be accessible 24 hours a day, 7 days a week? Or are working hours acceptable?
What are your contingency plans for when the report is broken, or otherwise unavailable? Realistically, how long can it be unavailable for without causing you problems? Does the last version need to be available somewhere if the whole system goes down? (In other words, what are your ‘disaster recovery’ expectations?)

C. The Specification

(aka ‘spec’)

This isn’t unique to reports, but a well-written spec is often the difference between a piece of work that goes smoothly, and one that doesn’t. The more effort you put in at the spec stage, the happier you’ll be with the finished product.

1. Scope

Even though the report may have date/time filters, what overall time period could the report encompass? Old data may not be immediately available, it may have been archived onto a different database or server. How flexible can you be? Could you make do with a data dump of pre-2018 data, and a report that only looks at 2018 onwards?

2. Dates

Always be very explicit about which dates you mean, most database records have multiple dates that are relevant to them. “I want all loans from the last year’ – the year by application date? Payout date? If this is for the underwriting department, maybe the underwriting date?

What do you mean by ‘the last year’? Last calendar year, last financial year, literally “today minus 12 months”, or “today minus 12 months, from the first of that month”?

“After 1st jan” — does that include January 1st or not? “Between March 1st and March 31st” — is that ‘2021-03-01 00:00:00’ to ‘2021-03-31 23:59:59.999’ ?

If your request involves ‘working days only’, that concept may not already be present in the system, which means extra work. Should Bank Holidays be included or excluded?

3. What does it look like?

If your report isn’t trivial, draw a rough sketch of it, on paper! If you’re able to, mock up the report in Excel / Google Sheets, and fill in any data that you already know. Even dummy data is useful. Accompanying hints are always welcome, e.g. “There’ll probably be 1000 orders per month, and the subtotal will be ~ 100k. Oh, but not July, that was a poor month, less than 500 orders / 50k.”

Give explicit column headings, and an expectation of how many rows there might be, or how many pages it’ll cover. If colour is important (“negative balances should be in red”), write that down too.

4. Miscellaneous data questions

Averages — mean, median or mode? You might ask for the mean, but what is that actually telling you, if you have extreme values? Might the median be a better measure?
Can you ‘make do’ with estimates? e.g. actuals for last 11 months, estimates for current month?
Ranking: If any part of the report involves ranking, how do you want to break ties?
What are you going to do with the report? Print it out? That might require special attention to formatting. If the answer is “paste the data into Excel, pivot it up”, maybe we can pre-transform the data for you?
Tricky data points: e.g. ‘source’ — do you want the marketing channel for the customer’s first application? Or first successful application? Or the current application?
Don’t just put “Customer name” as a required field — is that “Firstname Lastname” or “Lastname, Firstname”, or do you want the parts in separate fields? What about middle names? Instead of “Address”, specify exactly what you want: all fields in the database, or will “line 1 + postcode” suffice? Bear in mind that some databases will store address data as ‘line1′,’line2′,…,’line5’ (say), but other databases might have street name / locality / town / county, and it may be far from trivial to convert between the two formats.
How should blank data be displayed? What about obviously wrong data? (e.g. mobile phone numbers with < 11 digits)
Are there any groupings? Sub-totals?
Do you require filters? Or can you use CTRL-F within the report to find what you’re looking for?
Any graphs? Draw them! Bar chart, histogram? Pie chart? Axes? Scale? The more detail, the better.

Some of this may seem like overkill, but five minutes of consideration on your side can save hours on ours.

5. Finally

The person building your report has to deal with data from all over the company, and almost certainly won’t have your domain knowledge, so don’t be afraid to over-explain, or spell things out. This is especially true of finance reports; I’ve built finance reports in several different companies, and the terminology is never the same, so don’t assume. ‘Remaining balance’ — which balance, there are dozens. Including or excluding VAT? Gross or net figures?

Once you’ve written your spec and are happy with it, then it’s time to share with the report builder, but you should expect some back-and-forth, some questions and maybe some refinement of the spec. When everyone’s happy and understands exactly what’s required, then they should be able to provide you with timescales.

A rather obvious point: plan ahead, and request your report before you need it. If you desperately need it for tomorrow, you’re probably not going to get it. Yes, you might need it urgently, but the CFO has asked for account breakdowns for the auditors, and she’s higher up the pecking order!

… magic happens here, and your report is built…

A witch or wizard working their magic and generating the report

D. Post-build

We’re at the point where you’ve got the first version of your report; it’s highly unlikely that it’s perfect and doesn’t need some tweaks, so you need to factor that into your timeline. These sorts of projects are always iterative, the end-point only comes when you stop asking for changes.

The most important thing is verifying what the report is telling you. I can’t stress this enough: PLEASE DO NOT ASSUME THAT THE OUTPUT IS 100% CORRECT. You should not rely solely on the builder or BI department, it is as much your responsibility to ensure the report is correct, to the best of your ability. Just imagine going to your C-level executive and having to tell her the numbers you’ve been relaying to her for the past six months are wrong… Don’t be afraid to challenge any numbers, or request information about the sources used, the BI department should be happy to give you this information. Can you cross-check the data (or counts and totals) with other reports? Cross-reference with raw data on your CMS? It’s good practice to regularly check that your various data sources agree with each other — and to understand the reasons why they may not.

Lastly, documentation: BI will (should) document the report, especially things like exclusions, e.g. ‘these accounts were identified as test account and thus excluded’. But so should you, for your own benefit, and the benefit of the person who comes after you.

[In fact, I strongly encourage including the following information on the report itself somewhere: who requested the report, who built it, when it was built, when it was last updated, any related ticket numbers, and which server the report is generated on (which may be a dynamic value). Having this sort of info immediately available means reports get fixed much quicker, should they need to be.]

One final thing (I promise), a tip: every once in a while, check that the report is consistent over time. Example: say you ran the report in March, concerning January’s data; in June, run the report for January again, and see if the numbers have changed. Even if it’s just data being cleaned up, it’s good to know why and how it’s done, and what impact it has.

A man physically building a big wall-sized report

PS: I made these images with Image Creator.

reports, specs, wallboards

Leave a comment

Scorecards versus Neural Networks

Posted by sqlpete in chat, finance, modelling, scorecards on March 22, 2021

First, some explanatory points:

More often than not, scorecards are just scaled logistic regression models, if that makes this post easier to follow.
By ‘ML’, I mean the family of modern machine learning algorithms – neural networks (NNs), random forests, support vector machines, etc.
What I’ve written here has a big caveat: This happens in my experience. If you’ve experienced differently, please do let me know!

Introduction

I must apologise: the title, ‘Scorecards versus Neural Networks’, is sort of clickbait-y; these two things are related, but not the same — it’s a bit like saying ‘Bicycles versus Prams’, or ‘Pasta versus Bread’. The point of this blog post is that in my experience the people who build these models have different mindsets, which ⚠ spoiler ⚠ leads to scorecards having better results (in certain domains) than ML models.

Of course, neural networks can do things that a scorecard could never do — image recognition, for example — but there are good model-building practices that should apply whatever flavour of model you’re creating.

Scorecards

I’ve worked for a number of consumer lending businesses, and all of them used (one or more) scorecards as their primary lending decision mechanism. This isn’t just because I worked there, and scorecards are my ‘thing’; all the heads of Credit Risk I’ve worked with wanted to use scorecards. Scorecards are an old-fashioned technology, but they’re very easy to understand; they’re not a ‘black box’ like more modern ML techniques. Building a good scorecard is somewhere between an art and a science, but understanding it, what it’s doing and why it arrives at its final score, couldn’t be simpler: it’s just a case of adding up points for positive characteristics, taking away points for negative. An increase in a variable relating to a positive characteristic, leaving all other variables alone, gives you an increase in score, and vice-versa^*. Due to this simple nature, scorecards are trivial to code ‘in production’.

However, neural networks and other ML models tend to be highly non-linear, and there’s absolutely no guarantee that changing one input variable will give you the simple increase or decrease in score you were expecting. And they are definitely not trivial to implement.

^* This isn’t necessarily true if you build bad scorecards, but I try not to build bad scorecards 😉

Machine learning getting it wrong

Here are a few instances of machine learning models not behaving as expected; they’re not in the credit risk domain, but that doesn’t matter, there are principles in common.

1. “‘Typographic attack’: pen and paper fool AI into thinking apple is an iPod” (2021)

But even cleverest AI can be fooled with the simplest of hacks. If you write out the word “iPod” on a sticky label and paste it over the apple, Clip does something odd: it decides, with near certainty, that it is looking at a mid-00s piece of consumer electronics [rather than an apple]. In another test, pasting dollar signs over a picture of a dog caused it to be recognised as a piggy bank.

https://www.theguardian.com/technology/2021/mar/08/typographic-attack-pen-paper-fool-ai-thinking-apple-ipod-clip

2. “Amazon scraps secret AI recruiting tool that showed bias against women” (2018)

But by 2015, the company realized its new system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way. That is because Amazon’s computer models were trained to vet applicants by observing patterns in resumes submitted to the company over a 10-year period. Most came from men, a reflection of male dominance across the tech industry. In effect, Amazon’s system taught itself that male candidates were preferable.

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

3. “A bookshelf in your job screening video makes you more hirable to AI” (2021)

The addition of art or a bookshelf in the background made an Asian test subject seem much more conscientious and significantly less neurotic compared to the same faux applicant in front of a plain background.

https://www.inputmag.com/culture/a-bookshelf-in-your-job-screening-video-makes-you-more-hirable-to-ai

4. “AI image recognition fooled by single pixel change” (2017)

Computers can be fooled into thinking a picture of a taxi is a dog just by changing one pixel, suggests research.

https://www.bbc.com/news/technology-41845878

5. “Twitter apologises for ‘racist’ image-cropping algorithm” (2020)

Twitter has apologised for a “racist” image cropping algorithm, after users discovered the feature was automatically focusing on white faces over black ones. The company says it had tested the service for bias before it started using it, but now accepts that it didn’t go far enough.

https://www.theguardian.com/technology/2020/sep/21/twitter-apologises-for-racist-image-cropping-algorithm

From these stories, you could be forgiven for concluding that ML models aren’t all that they’re cracked up to be. Actually, I rather think the stories point to one or more of the following issues:

A misunderstanding of the problem domain
Inadequate or incorrect training data
A lack of testing, post-model build
Inadequate monitoring of models in production
Confusion about the limitations of models

Scorecards getting it wrong?

The above stories are all high-profile, because ‘AI’ is news. Scorecards aren’t news, they’re niche; and if a company lends to the wrong people due to a fault with the scorecards, the general public doesn’t get to hear about it.

Scorecards can suffer from each of the five issues I itemised above, in exactly the same way as ML. But scorecard builders will likely spend a lot more time ensuring their scorecard is valid.

I don’t really know of any instances of scorecards going wrong, other than data being incorrectly coded; it’s unfortunately quite common for missing data (blanks, NULLs) or extreme (‘out-of-bounds’) values to get coded as the ‘default’ modal value. There’s an apocryphal story about a meaningless random variable making it all the way through to production, but I think the story exists to scare junior scorecard builders. 😉

But really, a scorecard shouldn’t go live without being thoroughly (and provably) tested. It’s a serious business, and there’s often a great deal of money at stake.

What are neural networks actually doing?

The channel 3Blue1Brown (run by mathematician Grant Sanderson) has a series of ‘Deep learning’ videos that explain neural networks from the ground up. The first video, But what is a Neural Network?, describes a dataset of handwritten digits, scanned and discretised, that are used as example data to build a neural network. In fact, the classification rate is very impressive: the neural network classifies 98% of digits correctly.

In the second video, Gradient descent, how neural networks learn, we’re shown pictorially (at around the 14 minute mark) what the network is doing – it’s not picking out lines and loops and shapes common to numbers, as you might expect, but:

“[I]t would seem that in the unfathomably large 13,000 dimensional space of possible weights and biases, our network found itself a happy little local minimum that, despite successfully classifying most images, doesn’t exactly pick up on the patterns that we might have hoped for…

To human eyes, the patterns don’t look like anything at all, certainly nothing resembling numbers. And yet the NN is working at an accuracy rate of 98%! Even more disturbing:

… and to really drive this point home, watch what happens when you input a random image; if the system is smart, you might expect it to either feel uncertain, maybe not really activating any of those 10 output neurons, or activating them all evenly, but instead it confidently gives you some nonsense answer

So, with regards to this particular neural network:

You can’t tell how it works — you only know that it does.
If you put nonsense data in, you get a real (but totally irrelevant) answer out

As is pointed out in the videos, the network used for this model is somewhat old-fashioned, there are more modern alternatives.

A couple of anecdotes (not quite ‘case studies’) from my own personal experience:

Anecdote #1: Scorecard vs. neural networks in consumer lending

At one of the businesses I worked for, we trialled a neural network model to see if it could out-perform our recently-built scorecards. A well-known company (in the decision science arena) took our data and used their flagship tool to build a neural net model. Their whole process was web-based, end to end, and the interface (GUI/UX) was beyond impressive. They presented their model, which had a gini value 5 percentage points higher than our scorecard; that is, more predictive, and enough so that the savings made by better predictions were greater than the cost of using their software. [NB: The gini is just a classification score, from 0% (no better than randomly guessing) to 100% (perfect answer every time).]

However: this initial model used hundreds of variables (if I recall correctly, around 300), as opposed to our scorecard’s 20 or so. A very important part of managing risk is understanding the variables that make up a model, how they might vary over time, and their impact on the result. Because of their approach of using all the data available without question, their model had included a variable that was actually removed half-way through the lifetime of the build population, and wasn’t available in production. Once this variable was removed, and the number of variables reduced to under 100, their model gini was only marginally higher than our scorecard’s. Taking into account the operating cost of the model, there was no financial gain to be had from using the more complicated model.

I subsequently worked with another lender (in this case, B2B) who used the same decision science company’s product to provide a credit-worthiness score – it was one of several scores available to underwriters, all the loans were manually assessed. The lead underwriter said that in her opinion, the score was very good, but would occasionally give scores that didn’t make sense alongside their other data. This meant underwriters began distrusting the neural network score, and because there was no way of understanding WHY the score was how it was, it ended up unused in the overall decision-making process, even though they were still paying for it for each application.

For me, the lesson here is not that the model (or software) was in any way faulty – it did exactly what was being asked of it. But there was a lack of attention to detail with regard to (a) the input data, and (b) how the models were going to be used in practice.

Anecdote #2: Scorecard vs. ML models in B2B marketing

I was very recently involved in a marketing project that pitted a scorecard against a random forest model, built by a third-party company (not the same company as in the previous section). The dataset comprised several thousand companies, with some basic data (e.g. location, size, industry classification code), and an outcome indicating whether previous marketing had been successful. For new companies, could we predict whether marketing to them would be worth the cost?

Apart from the different domain, it was a very similar exercise to the above. The main difference was in the lack of ‘per company’ data; for lending models, you typically have access to thousands of variables from a credit file. Instead, we had to get creative, and derive our own variables from the limited amount of information we had.

The third-party had gone through several iterations of models — including neural networks, support vectors machines (SVM) and logistic regression, among others — and settled upon a random forest as giving the best results.

In the end, despite taking more than four times as many person-days to build, when used in production, their random forest model had a slightly lower gini than our scorecard — it wasn’t as good. In my post-project documentation, I put this down to a lack of creativity at the variable creation stage; during our build, we’d managed to generate dozens of new variables from our existing data, a few of which were very predictive.

For full disclosure, I should mention that we subsequently ran both models over a dataset that the models weren’t built on (that is: known companies we wanted to re-approach, not ‘first-time’ marketing), just to see what would happen: the random forest did pretty well, the scorecard was awful! Unfortunately, there was no time allotted to investigate why we saw these results, but it shows that there was a robustness to the ML model that we weren’t expecting.

End notes

I think my central point is fairly obvious, but to state it explicitly: a simple model built by a domain expert will be (probably much) better than a complicated model built by an expert in building complicated models. To illustrate the difference: for consumer lending scorecards I’ve built, the timeline was divided up approximately like this:

Stage	Time taken
Data gather (including cleaning)	20%
Variable assessment (including creating new ones)	25%
Scorecard build	15%
Checking, validating and documenting the model	40%

But when third-party companies have built models, their timeline was more like this:

Stage	Time taken
Data gather (including cleaning)	5%
Variable assessment (including creating new ones)	5%
Model build	80%
Checking, validating and documenting the model	10%

Nearly all the time is put into the model build — e.g. trying the different algorithms (neural net vs random forest vs logisitic regresion …), tuning hyper-parameters, etc., in order to get a good classification rate.

But building a good model is about so much more than the overall classification rate, or the algorithm you choose. There’s a huge chunk of work to be done, pre- and post- model build. If that gets ignored, then the selected model could be a poor one. I’m sure the third-party companies were perfectly capable of doing the extra work; but it has to be communicated, costed, and agreed upon.

A final couple of points:

One criticism of ML models has been the lack of visibility of the importance of the features (variables) used in the model. An approach is to look at the SHAP values (Shapley Additive Explanations): see A Novel Approach to Feature Importance — Shapley Additive Explanations for more details.
Working out what neural networks are doing ‘under the hood’ is a very active area of research; for instance, see
A New Approach to Understanding How Machines Think and New Theory Cracks Open the Black Box of Deep Learning.

artificial intelligence, machine learning, models, neural networks, random forests, scorecards

3 Comments

Brute-force packing

Posted by sqlpete in chat, coding on February 21, 2017

Because I’m a huge music fan, I own quite a few CDs; rather more than I’d readily admit to! A few years ago, I started ripping my collection to FLAC, a lossless audio format, and backing everything up to DVD-R. In order to be efficient, I want to get as many albums as I can on each DVD-R disc, minimising the space wasted — this is known as a Packing Problem.

Now, I’d love to be able to present you with a highly-optimised algorithm that I wrote, but that’s not what I did: I brute-forced it. Processor cycles are very cheap, and if it’s going to be orders of magnitude quicker to iterate a few million times than it will be to research a whole new area of maths, then iterating it’ll be. My original code was a VB app (so I could drag folders onto a GUI), but here’s a similar version of the code in R:

set.seed(1234);
containerSize <- 4500; # roughly DVD size in MB
itemSize <- c(1641,1498,1747,751,1090,164,1602,1020,1126,553); # album sizes in MB
cat(sprintf("No. containers needed (no partitioning): %5.2f\n", sum(itemSize) / containerSize));

Z <- 1000; # Number of iterations

# To keep track of the best partition
best.remainder <- 1.0;
best.partition <- NULL;

for(i in 1:Z) {

working <- sample(itemSize); # randomly re-order our list of sizes
partition <- list();
k <- 1;
# Using the order as per 'working', partition the items
# such that the container size isn't exceeded:
while (length(working) > 0) {
  this.partition.indexes <- which( cumsum(working) <= containerSize );
  partition[[k]] <- working[this.partition.indexes];
  working <- working[-(this.partition.indexes)];
  k <- k+1;
}
npm1 <- length(partition) - 1; # Number of partitions minus 1
partition.totals <- unlist(lapply(partition, sum));
remainder <- (sum(rep(containerSize, npm1) - partition.totals[1:npm1]))
    / (npm1 * containerSize);

if (remainder < best.remainder) {
  best.remainder <- remainder;
  best.partition <- partition;
  totals.str <- paste("(", paste(partition.totals, collapse=","), ")", sep="");
  partition.str <- paste(unlist(lapply(partition,
      function(x) paste("(",paste(x,collapse=","),")",sep=""))),collapse=",")
  cat(sprintf("i = %3d, rem. = %5.2f%%; totals = %s; partition = %s\n", i,
      remainder * 100.0), totals.str, partition.str));
}

} # end for loop

This code (1000 iterations) runs in the blink of an eye:

i =   1, rem. = 19.00%; totals = (3772,3518,3902);
        partition = (1498,164,1090,1020),(1126,751,1641),(1602,553,1747)
i =   2, rem. = 13.56%; totals = (4439,3341,3412);
        partition = (1602,1090,1747),(553,1498,1126,164),(1641,1020,751)
i =   4, rem. = 13.18%; totals = (3963,3851,3378);
        partition = (1090,1747,1126),(751,1498,1602),(1641,553,164,1020)
i =   6, rem. =  4.78%; totals = (4303,4267,2622);
        partition = (1641,1747,164,751),(553,1126,1498,1090),(1020,1602)
i =  13, rem. =  4.04%; totals = (4301,4335,2556);
        partition = (1020,1126,553,1602),(1747,1498,1090),(751,1641,164)
i =  23, rem. =  0.26%; totals = (4478,4499,2215);
        partition = (1090,1641,1747),(1020,1126,1602,751),(1498,553,164)
i = 524, rem. =  0.02%; totals = (4499,4499,2194);
        partition = (1126,1602,751,1020),(1747,1498,1090,164),(553,1641)

The figure rem. is the percentage of space wasted on the discs that could be full — clearly, not all the discs can be 100% full. So in this case, I knew I was going to be burning three DVD-Rs, but there’s only 1 MB of unused space on each of the first two discs; for the third, I can either find some other files to backup, or keep those two albums to burn later — which is what I usually do; saving even more space, by repeatedly putting off burning the least-full disc.

brute-force, iterating, packing, R

Leave a comment

Reports: a waste of time?

Posted by sqlpete in chat, reports on January 29, 2017

At a previous place of work, we had a central reporting server containing hundreds and hundreds of reports, covering every section of the business. Manifestly, reports are vital to running a company, especially a finance company that essentially makes its money by shifting data from one database to another. One day, a colleague decided to spring-clean the whole reporting structure, and what better way to start than pruning the dead wood, getting rid of reports that were no longer used — there have got to be at least a few, right? What he found was pretty astonishing: 95% of reports hadn’t been looked at in the last 3 months. Ninety-five percent! (Maybe some of those were quarterly, six-monthly or yearly reports; everything was archived, not actually deleted, so could’ve been re-instated if necessary. I don’t remember that happening.)

The lessons were clear: managers were asking for reports to be built, when they didn’t need them. Either they never needed them, or one-off data pulls would’ve sufficed. Now, reports aren’t free, they cost time and money:

1. Initial planning — any of: a quick chat, an email, a meeting, a proper design spec, a full project plan.

2. Writing the query behind the report, which is never as simple as SELECT * FROM Table WHERE CreatedDate BETWEEN @StartDate AND @EndDate. Often, data from different servers needs to be brought together, which involves asking a DBA to replicate data (something else that has to be managed and tracked). The query will undoubtedly involve some level of aggregation, which may mean creating new tables and scheduled jobs to keep the aggregate data refreshed (another place where milestone tables come in handy).

(Of course, if you’re lucky enough to have a comprehensive, well-maintained data warehouse, then the above might be greatly simplified.)

3. Writing the report itself, which might be a matter of minutes, or could take days if the output has to be, say, formatted for printing, and/or has hierarchical expand/collapse functionality, and/or has many input parameters, especially if they affect the layout as well as the data that gets returned.

4. Testing the report, deploying the report, getting it emailed out on a schedule, etc., etc.

Reports aren’t free, and to realise that 95% of anything was ultimately unnecessary, is as hugely annoying for the report writer, as it is for their manager who could’ve used their resource elsewhere.

So, lesson learnt, and the company went forth with a steely resolve to ensure that reports were only built if the business need for them could be proved beyond doubt, and this came to pass. Dozens and dozens of new reports were built, but each one was vital, and had a clear purpose.

You might be able to guess the next part… The same spring-clean exercise was repeated again, over a year later: 80% of the reports hadn’t been looked at in the last 3 months.

What’s to be gleaned from this? At the heart of it, is still the issue that reports are perceived as ‘cheap’. The only way to prevent this level of waste, is … somehow levy a charge on the original requester? Incur a penalty if reports go unused? Accept the whole situation as inevitable? I don’t know the answer. If you have any practical ideas, please let me know in the comments.

Some footnotes:

1. During the periods in which these reports were going unused, the business was constantly learning and changing. If, prior to the first exercise, I’d have been asked to guess how many reports weren’t used, I’d have said 30-50%.

2. I’m not trying to imply that my previous workplace was sloppy or inefficient, it wasn’t. I’ve seen this happen more than once, it’s just that the numbers were so striking that time.

3. The cleverest people I’ve ever worked for didn’t require lengthy reports to help them perform their duties, they could often elicit the truth from a handful of numbers. A skill I will always be envious of!

milestone tables, reports

Leave a comment

Using median rather than mean

Posted by sqlpete in chat, stats on September 29, 2016

Analysts in consumer finance are used to dealing with large sets of incoming customer data, and are often asked to provide summary statistics to show whether application quality is changing over time: Are credit scores going down? What’s the average monthly income? What’s the average age of the customers applying this quarter, compared to last?

Early on in the process, the data is likely to be messy, unsanitised, and potentially chock-full of errors. (Once the data has been processed to the point of considering credit-worthiness, then it’s all clean and perfect, yeah..?) Therefore, you have to be careful when you report on this data, as it’s easy to get nonsense figures that will send your marketing and risk people off in the wrong direction.

Two really common errors seen in numerical application data:

1. Putting gross yearly salary into the net monthly income field, and it’s not always easy to spot which one it ought to be: if it says ‘£50000’, then it’s very likely to be an annual figure; but what about ‘£7000’? If monthly, that’s a really good wage; but I can assure you, people earning that much are still applying for credit.

2. Dates of birth: if unknown, it’s common to see ‘1st January 1900’ and similar. So when you convert it to age, the customer is over 100.

Also, if you happen to get credit scores with your data, you have to watch out for magic numbers like 9999 – which to Callcredit means “no [credit] score could be generated”, not that the customer has the credit-worthiness of Bill Gates or George Soros.

Hence, it’s fairly obvious, that if you’re including these figures in mean averages, you’re going to be giving a misleading impression, and people can infer the wrong thing. For example, say you’ve 99 applications with an average monthly income of £2000, but there’s a also an incorrect application with a figure of £50,000. If you report the mean, you’ll get an answer of £2480, instead of the correct £2010 (assuming that £50k salary translates to ~£3k take-home per month). However, if you report the median, you’ll get an answer of £2000, whether the incorrect data is in there or not.

In statistical parlance, the median is “a robust measure of central tendency”, whereas the mean is not. The median isn’t affected by a few outliers (at either end).

End note: Credit scores can be (roughly) normally distributed; for a normal distribution, the median and mean (and mode) are the same. But data doesn’t have to be normally distributed: e.g. call-waiting times follow the exponential distribution, where the median and mean are not the same.

credit score, mean, median, reporting, stats

Leave a comment

Converting JSON to XML : A Gateway to Cygwin

Posted by sqlpete in chat on September 14, 2016

Of the files I have to deal with on a weekly basis, I’d put the breakdown at 50% Excel, 40% CSV, and 10% XML. This is fine, I can reliably transfer data from those files into SQL Server without too many tears. However, today I was presented with JSON-formatted versions of files, that I’d normally get as XML; and I haven’t had to deal with JSON since I last wrote PHP/AJAX code, about five years ago.

Now, SQL Server 2016 can natively read/write JSON code (see, for example, this MSDN blog), but I use SQL Server 2014, which knows nothing about JSON.

Of course, I googled for JSON to XML converters. There are many, mostly in the form of libraries for other systems, and even a few online converters that would do the job ‘in-browser’. Unfortunately, the data I needed to convert was credit file data, and that data is sensitive. I can’t just go pasting it into unknown websites without completely understanding what’s going to happen to it – if there’s the slightest chance my data could get uploaded and saved elsewhere, I can’t use that site. I did find an online site that did the conversion purely in javascript (no POSTs back to the server), so I copied the code locally, pasted my JSON in… and it crashed the browser (Chrome). Turns out 80kb of JSON was too much for this javascript, and in fact, a couple of the standalone converters I tried also had trouble with this (small) amount of code.

There was even a pure T-SQL converter (written as a function) that I tried, but unfortunately, that didn’t work out either. Which is a shame, as a SQL-based solution appeals to me greatly!

To cut a dull story short, here’s how I did it: perl. Thanks to the third most popular answer to this stackoverflow question, the answer was to open up a cygwin window, and type:


cat MyFile.json | perl -MJSON -MXML::Simple -e 'print XMLout(decode_json(do{local$/;}),RootName=>"json")' > MyFile.xml

(Thank you very much, stackoverflow user azatoth!)

And that did the trick; I had to do some minor tidying up (due to @ symbols, and node names starting with a number), but in the main, it did the job for me, with a minimum of effort.

The point of this post is two-fold:

When this requirement crops up again, I only have to look here to remind myself, and…
To spread the word about cygwin.

Cygwin (www.cygwin.com) is a way to get hold of Unix/Linux-style functionality on Windows. I’ve used it for years now, and it’s an invaluable part of what I do; it’s literally one of the first things I install on any new machine.

If you do any significant amount of text file processing, there are many great command-line tools to be found within the cygwin environment; just a few I use on at least a weekly, if not daily, basis:

grep: for searching through text files, using regular expressions
sed: for basic text transformations
awk: a programming language for text processing
perl: a programming language widely used in the *nix world

The beauty of these tools, is that they’re so widely used, it’s almost guaranteed that whatever you want to do, someone else has already put the correct syntax online (cf. my JSON to XML problem). Usually, some light googling (often incorporating the term ‘+stackoverflow’) will get you your answer. I wouldn’t claim for a second that I ‘knew’ these tools (apart from maybe grep), but being able to work with them is enough.

If you’re a developer or analyst who has to routinely work with data files, I can’t recommend cygwin highly enough.

cygwin, json, perl, xml

Leave a comment

Getting it all wrong

Posted by sqlpete in chat, process on February 14, 2016

In consumer lending, there are many constraints that have to be satisfied before any money is paid out to a customer. Off the top of my head, the customer must be:

Affordable
In employment (not always, but usually)
Over 18 (and under some policy-defined upper age limit)
Not bankrupt or otherwise insolvent
Accepted by the credit risk scorecard
Not a PEP (Politically Exposed Person)
Not an SDN (Specially Designated National), someone the government doesn’t want us to do business with
Not suffering from mental health problems, or otherwise vulnerable
Not matched by our own ‘do not lend’ list – i.e. they don’t match (or have links to) previous frauds or undesirable customers
Accepted by the fraud risk scorecard
Not suffering from an addiction to gambling
Not currently in prison
Thought to be telling the truth about their financial situation

(The exact list isn’t important here.)

Some of these statuses are easier to determine than others, but it’s impossible to get any of them 100% correct. Credit files are certainly imperfect, and the data they contain is always going to be out-of-date to some extent; PEP/SDN lists are pretty fuzzy; and of course, some people lie about their financial situation in order to get a loan paid out.

For the sake of demonstration, let’s assume we’re 99% accurate for each one: we get it wrong one time in a hundred. Given that error rate, what’s the probability of getting the overall decision wrong — that is, not satisfying every constraint — and paying out to someone we shouldn’t? Simple probability tells us it’s one minus the probability of getting all of them right, so for the 13 constraints above, it’s 1 – (0.99 ^ 13) = 12.2%. In other words, we’ve a one in eight chance of paying out when we shouldn’t; the consequences of which could be loss of money, regulatory or legal issues, damage to reputation, etc.

You could break some of those constraints into smaller parts: e.g. affordability is a calculation that relies on both income and expenditure being calculated correctly. Hence, satisfying those 13 constraints is down to many more smaller decisions (a scorecard involves hundreds of calculations). In which case, our situation is even worse: at our 99% error rate, it only takes 69 decisions to make it more likely that our overall decision is wrong! (1 – (0.99 ^ 69) = 50.02%.)

Luckily, the real-life situation isn’t this bad:

For some decisions we’re far better than 99% accurate, e.g. “aged over 18” is reasonably simple to get right
We test our scorecards thoroughly (yes, we do).^*
We corroborate information by obtaining it from several sources
The various decisions aren’t all independent (our calculation assumes they are)

^* Note that we’re not concerned here with scorecard predicting correctly – just functioning correctly, in accordance with how it was built.

To be honest, I’ve never estimated what the actual “incorrect lending decision” rate might be (my gut feeling is it’s satisfactorily low enough), but I’ve put it on my list of things to consider. And, as ever, I know that I won’t have a hope of getting an accurate answer if the data isn’t readily available and undistorted.

accuracy, error rates

Leave a comment

	sqlpete on Determining SQL datatypes from…
	sqlpete on Two subtitles at the same…
	sqlpete on Two subtitles at the same…
	sqlpete on Practical queries using binary…
	sqlpete on Distances in SQL with geograph…

Archive for category chat

Poking My Nose Into Other People’s Business

Lending Calculators

LoanSims

Sus(s) Udio?

Other ‘stuff’

Why do we use MD5s?

When to use MD5s

Can’t I trust Windows to move or copy a file?

How do I generate a .md5 file?

Other uses for MD5

Wrap-up

AI I’ve personally used

How do I get my AI news and info?

Generating Images

Artists against Image Generators

Cartoons

Even teaching about AI can cause controversy

But what does ChatGPT think?

On the firing line?

Image Creator from Microsoft Bing

Book covers

Writing books

Clip art

How I see my immediate future with AI

Intro

A. Reports are a pain

1. Specifications

2. Data assumptions

3. Wasted resource

4. Maintenance

5. Apps

6. Specialisation

7. People

B. Matters to nail down before proceeding

1. Do you actually need it?

2. Does it already exist?

3. Other considerations

C. The Specification

1. Scope

2. Dates

3. What does it look like?

4. Miscellaneous data questions

5. Finally

D. Post-build

Introduction

Scorecards

Machine learning getting it wrong

1. “‘Typographic attack’: pen and paper fool AI into thinking apple is an iPod” (2021)

2. “Amazon scraps secret AI recruiting tool that showed bias against women” (2018)

3. “A bookshelf in your job screening video makes you more hirable to AI” (2021)

4. “AI image recognition fooled by single pixel change” (2017)

5. “Twitter apologises for ‘racist’ image-cropping algorithm” (2020)

Scorecards getting it wrong?

What are neural networks actually doing?

Anecdote #1: Scorecard vs. neural networks in consumer lending

Anecdote #2: Scorecard vs. ML models in B2B marketing

End notes

Recent Posts

Follow me on Twitter

Recent Comments

Archives

Categories

Meta