Regression: The analyst’s penknife

At work, I build regression models all the time; usually logistic regression, where the target variable is binary (0/1 or no/yes), e.g.:

  • Sales and marketing models (predicting which potential leads will convert)
  • Credit risk accept/reject scorecards (if we lend this person money, will they pay us back?)
  • Collections models (which of our customers is most likely to default in the next month?)
  • Fraud models (is this customer who they say they are?)

, but occasionally linear regression models too, where the target is a number (e.g. number of sales). The modelling process is basically the same between linear and logistic regression, it’s just that the model output requires a different interpretation.

These models are all probabilistic – that is, there is a component of randomness in the model that means there can never be a mathematical formula that produces the true output for any given inputs. But regression can help us do more than that. I want to mention a couple of times where I’ve used linear regression to uncover deterministic equations; that is, an exact and precise mathematical description that links the inputs to the output.

1. Decrypting a secret number (sort of)

A company presented questionnaires to people, where each question had multiple-choice answers. For various reasons, the answers were required to be secret within the database – an average developer was able to see the data, but they weren’t supposed to be able to link answers to a person. (This wasn’t ‘top secret’ data; the secrecy was more for marketing purposes than anything.)

The database tables looked something like this (I’ve only shown the important columns and keys):

CREATE TABLE dbo.Person
(
	PersonID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire
(
	QuestionnaireID INT NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE dbo.Questionnaire_Completed
(
	QuestionnaireID INT NOT NULL
	,PersonID INT NOT NULL
	,PRIMARY KEY CLUSTERED (QuestionnaireID, PersonID)
)

CREATE TABLE dbo.Question
(
	QuestionID INT NOT NULL PRIMARY KEY CLUSTERED
	,QuestionnaireID INT NOT NULL
)

CREATE TABLE dbo.Answer
(
	AnswerID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
	,QuestionID INT NOT NULL
	,AnswerKey INT NOT NULL		-- mysterious!
)

For each answer, we knew what question and therefore questionnaire it related to, so we could do:

SELECT
		q.QuestionID
		,a.AnswerID
		,COUNT(1) AS N
	FROM dbo.Answer a
	JOIN dbo.Question q
		ON q.QuestionID = a.QuestionID
	WHERE q.QuestionnaireID = 123
	GROUP BY q.QuestionID, a.AnswerID

to get the results for a particular questionnaire. This whole process was built and administered by a single person. One day, that person was unavailable, and there was an urgent problem relating to a particular questionnaire; for legitimate reasons, a higher-up needed to know the answers a person had given.

Now, if the table Questionnaire_Completed had a field showing when the entry was inserted (a default Create date, see this post, for instance), then matching the data would’ve been trivial: the order in which the questionnaires were done would’ve almost certainly matched between the Questionnaire_Completed and Answer tables. However, we didn’t have that date; and it was probably not recorded for this very reason, to prevent correlating the pieces of information that would instantly reveal who answered what.

What we did have was the mysterious field AnswerKey in the Answer table. It wasn’t immediately obvious what it represented; it was just an integer that appeared random:

1763546779
-1375398177
406260249
-241234845

, so what was it? Clearly, it was very likely related to the PersonID (otherwise why obfuscate it?), but scrambled in a way to make it purposely difficult to join back. What to do? Well, linear regression tells you how numbers relate to other numbers, statistically, so that seemed a good place to start.

I created a rectangular dataset of what we had: QuestionnaireID, QuestionID, AnswerID, and AnswerKey (several thousand rows), and imported the data into R. Using linear regression (the glm function), I pretty quickly worked out that AnswerKey was of the form:

AnswerKey = k + a * QuestionnaireID + b * QuestionID + {SOMETHING ELSE}

How did I know this? Easy: the constant k, and parameters a and b were suspiciously close to being whole numbers, something that’s unlikely to happen by chance.

But what was the {SOMETHING ELSE}? It didn’t take very long to figure out that is was a simple linear function of PersonID and a fixed integer, one for each questionnaire. Given that we knew which PersonIDs had completed which questionnaire, we had a reduced set of possibilities — only certain values made sense.

The final formula for AnswerKey was:

k + a * QuestionnaireID + b * QuestionID + c * {Questionnaire Secret Number} + d * PersonID

, hence we could work out the PersonID for every record in the Answer table because we knew all the other pieces of information.

Overall, it probably took less than an hour to ‘crack’ the formula for the AnswerKey. As I said at the beginning, this wasn’t information that needed industrial-strength security, so the method used to hide it wasn’t inappropriate. This method of encrypting values isn’t something anyone should consider for sensitive information, e.g. bank details. For that sort of information, SQL Server provides some far better tools: see here. (Or even better: don’t store the data yourself, get a third party company to do it!)

2. Payments on loans with existing charges

NB: I’m working on a post and document which explains this piece of work in much greater detail – watch this space!

Let’s say a customer takes out a loan for £500 over 3 months, and the interest rate is 1% per day (this is a very high interest rate, illustrative purposes only!). If the months are January-March in 2019, then there are 31 days in month 1, 28 days in month 2, 31 days in month 3. How much should the customer pay off each month if (a) the loan is to be completely paid off after the 3rd payment, and (b) the payments are equal in amount?

This isn’t a very straightforward calculation, as the lengths of the months are unequal, and the amount of outstanding principal balance (OPB) that accrues interest will be decreasing each month. The answer doesn’t really matter here – but it’s £275.49 for the first 2 payments, and £275.47 as the final payment (2 pence less, due to rounding). In total, they will pay back £826.45, £326.45 in interest(!)

Now, I inherited some code that calculated these payment amounts, but it was very messy, and there was even a separate function for each number of months. Starting from first principles, I worked out a single equation that took the number of months as a parameter. This worked fine for calculating initial payment schedules, i.e. those for new customers.

However, the business also needed an equation for when payments were being ‘rescheduled’, and there were charges present – e.g. our customer had paid off some of the loan, but had incurred a late fee (£10 say), and was now asking to spread the repayments over 4 months instead of 3. This is where it gets trickier: charges don’t (legally, can’t) accrue interest, so our original equation no longer works. And it was far from clear how to adapt what we had to include this separate balance that didn’t accrue interest.

It’s actually possible to numerically calculate the payments for any schedule, all you need is code to step through the process of accrual and payment. You put your guess in at the top of the code, and see if the balance is zero at the end. If not, adjust the guess slightly, run it through again – keep going, until you get a balance of zero. This is standard stuff, e.g. see Newton’s method for an optimised way of getting to the answer faster.

So for any input we care to put in, we can actually get the correct answer. Doesn’t this automatically solve our problem? Why do we need the precise equations? Three reasons: (1) an equation will give us exact answers, and we’ll understand the precision of the answers we’re getting; (2) an equation will (well, should) take fewer steps to work through, hence less server processing; (3) the satisfaction of working it out!

At this point, I could generate the output (monthly payment) for any input (loan amount, interest rate, number of periods + their lengths, outstanding charges balance), but I just didn’t know how they were related exactly.

I randomly generated values for tens of thousands of inputs, calculated the output using the iterative procedure described above, and made a rectangular dataset. In R, as before, I used glm to start fitting models that helped me identify the relationship between the input and output values; in fact, I came up with code to auto-generate thousands of model formulas, and let the computer do even more of the work for me. For instance, if we have inputs A, B, C, D and an output E, then we can let the computer come up with new formulas at random:

E ~ A + D + (1/B)
E ~ A + B + (C/D) + ((A-B)/(C+D))
E ~ (C-B+A)/(D+A) + (B*C*D)/(A-3*D)
E ~ (A-3*B-C^2)/(sqrt(D+3)-sqrt(A*B))

etc. (The algorithm was a bit smarter than this, I’m just giving a flavour here.) It took a few evenings of trial and error playing with the code, but eventually, this approach worked, and I zoned in on the exact equations I needed. Unfortunately, it turns out that there’s no single equation, but multiple for each number of periods – however, they’re very ‘constructable’, the resulting function in code is very straightforward. I didn’t yet understand how to derive these equations from first principles, that was the next step – but I knew what I was aiming for, which made that task much easier.

, ,

  1. Leave a comment

Leave a comment