Archive for January, 2017
At a previous place of work, we had a central reporting server containing hundreds and hundreds of reports, covering every section of the business. Manifestly, reports are vital to running a company, especially a finance company that essentially makes its money by shifting data from one database to another. One day, a colleague decided to spring-clean the whole reporting structure, and what better way to start than pruning the dead wood, getting rid of reports that were no longer used — there have got to be at least a few, right? What he found was pretty astonishing: 95% of reports hadn’t been looked at in the last 3 months. Ninety-five percent! (Maybe some of those were quarterly, six-monthly or yearly reports; everything was archived, not actually deleted, so could’ve been re-instated if necessary. I don’t remember that happening.)
The lessons were clear: managers were asking for reports to be built, when they didn’t need them. Either they never needed them, or one-off data pulls would’ve sufficed. Now, reports aren’t free, they cost time and money:
1. Initial planning — any of: a quick chat, an email, a meeting, a proper design spec, a full project plan.
2. Writing the query behind the report, which is never as simple as SELECT * FROM Table WHERE CreatedDate BETWEEN @StartDate AND @EndDate. Often, data from different servers needs to be brought together, which involves asking a DBA to replicate data (something else that has to be managed and tracked). The query will undoubtedly involve some level of aggregation, which may mean creating new tables and scheduled jobs to keep the aggregate data refreshed (another place where milestone tables come in handy).
(Of course, if you’re lucky enough to have a comprehensive, well-maintained data warehouse, then the above might be greatly simplified.)
3. Writing the report itself, which might be a matter of minutes, or could take days if the output has to be, say, formatted for printing, and/or has hierarchical expand/collapse functionality, and/or has many input parameters, especially if they affect the layout as well as the data that gets returned.
4. Testing the report, deploying the report, getting it emailed out on a schedule, etc., etc.
Reports aren’t free, and to realise that 95% of anything was ultimately unnecessary, is as hugely annoying for the report writer, as it is for their manager who could’ve used their resource elsewhere.
So, lesson learnt, and the company went forth with a steely resolve to ensure that reports were only built if the business need for them could be proved beyond doubt, and this came to pass. Dozens and dozens of new reports were built, but each one was vital, and had a clear purpose.
You might be able to guess the next part… The same spring-clean exercise was repeated again, over a year later: 80% of the reports hadn’t been looked at in the last 3 months.
What’s to be gleaned from this? At the heart of it, is still the issue that reports are perceived as ‘cheap’. The only way to prevent this level of waste, is … somehow levy a charge on the original requester? Incur a penalty if reports go unused? Accept the whole situation as inevitable? I don’t know the answer. If you have any practical ideas, please let me know in the comments.
1. During the periods in which these reports were going unused, the business was constantly learning and changing. If, prior to the first exercise, I’d have been asked to guess how many reports weren’t used, I’d have said 30-50%.
2. I’m not trying to imply that my previous workplace was sloppy or inefficient, it wasn’t. I’ve seen this happen more than once, it’s just that the numbers were so striking that time.
3. The cleverest people I’ve ever worked for didn’t require lengthy reports to help them perform their duties, they could often elicit the truth from a handful of numbers. A skill I will always be envious of!
When building a credit risk scorecard, it’s standard practice to take a continuous variable and discretise (or ‘bin’) it into a small number of bands*. A common approach is to:
- Partition the variable into 10-20 subsets of equal size — This is called ‘fine classing’
- Use bad rates** to combine similar adjacent subsets, to produce a variable with fewer levels, while not overly reducing its significance — This is called ‘coarse classing’
There’s a neater, simpler way to work out a good set of bands for our continuous variables, using decision trees.
* A ‘proper’ statistician would never do this, but this is just what we do when we build credit risk scorecards. Please don’t blame me! 🙂
** Or odds ratio, or chi-squared, or whichever statistic makes most sense to you. Personally, I use Fisher’s exact p-value.
For the sake of example, let’s say we have a dataset composed of a binary outcome, Bad (e.g. ‘Went s payments down within t months: yes/no’), and a single explanatory variable: Age.
We’ll generate a dummy dataset:
# Let's add an air of sophistication by generating our variable # from a truncated normal distribution: library(truncnorm); Age <- rtruncnorm(1000,a=18,b=65,mean=35,sd=10); # 1000 samples summary(Age); # Verify that 18 <= Age <= 65 ; run hist(Age) to check it looks 'normal' # Dummy up a relationship between Age and the outcome: z <- (-0.1 * Age) + 1.5; prob <- 1/(1+exp(-z)); # inverse logit function Bad <- rbinom(1000, 1, prob); table(Bad); Bad 0 1 857 143
For this dummy dataset, our bad rate is 14.3%.
# Check the Age coefficient is ~ -0.1, and the intercept is ~ 1.5 # (they probably won't be very close) glm(Bad ~ Age, family=binomial); (Intercept) Age 1.10617 -0.08565 # As you can see, not very close. # (What happens with 10,000 samples, instead of 1,000?)
rpart is an R function (and library) for creating decision / classification trees; see, for example, r-bloggers. Let’s try running rpart on our data:
library(rpart); # Load in the already-installed rpart library library(rpart.plot); # For fancy-looking decision trees rp <- rpart(Bad ~ Age); # Similar formula-based syntax to glm() fancyRpartPlot(rp);
At the bottom of the diagram, you can see 4 leaf nodes, hence we have 4 age bands — which seems ok. However, look at the bottom-right node: it’s only got 7 cases in it! When building scorecards, we don’t want bands with so few cases in, they won’t be stable over time.
Fortunately, we can specify the minimum size a node can be:
rp <- rpart(Bad ~ Age, control=rpart.control(minbucket=50)); fancyRpartPlot(rp)
Only 3 bands this time, but the minimum band has 137 cases — much better. Let’s see what the thresholds are:
rp$splits[,"index"]; Age Age 35.29226 25.94350 # Create the banded variable: AgeBand <- cut(Age, breaks=c(-Inf, 25.94350, 35.29226, Inf), right=FALSE);
If we crosstab AgeBand and Bad, we get:
0 1 Total % [-Inf,25.9) 94 43 137 31.387 [25.9,35.3) 290 66 356 18.539 [35.3, Inf) 473 34 507 6.706 Total 857 143 1000 14.300
(I used table to do the crosstab, then used addmargins and cbind to add the margins and percentages.)
Clearly, as age increases, the bad rate decreases significantly. The IV (Information Value) for AgeBand is 0.4954, so it’s a variable that would be a definite candidate for inclusion in our final scorecard.
Although this is an easy method of working out the bands, I’d still recommend the traditional method alongside, as (a) you’ve got more control over the combining of ‘fine’ classes, and over the relative percentages of bads in each band, and (b) it’s useful to have more than one discretised version of a variable available to the model building process — especially if your scorecard is based upon a logistic regression, and not a set of ‘weights of evidence’. The regression takes correlation into account, and hence the ‘best’ bands for a particular variable can be different once other, more information-rich variables have been added to the model.
In my previous post, I wrote about GUIDs (aka UUIDs), marvellous datatypes that allow you to uniquely identify a piece of information. Because the probability of randomly generating two the same is so small, there’s a sense that merely possessing one as a key to a piece of data, means the data is somehow secured. This is false, in two senses — the first sense, not so critical; the second, very critical.
Before that, let’s look at how they might be used. I’ll use URLs / webpages as an example, because it’s a common application.
Let’s say I’ve written my own blogging system. The text of each article is stored in a database table, and the primary key of the table uniquely identifies my blog posts. When I list my blog posts so people can click on them, the URL might look like this:
The unique identifier — in this example, the number 17 — forms part of the URL, and someone can look at all my blog posts merely by changing the number after the equals sign. It’s ok, it’s only a blog; the worst they can do is put in a number of a post I haven’t published yet. Depending on how I’ve written my blogging engine, they’ll either see an unfinished post, or (preferably) a message saying there’s currently no live article with that id.
In the bad old days of internet website development, lesser developers used integer IDs to refer to data more sensitive than a blog post:
(Of course, I’m making these URLs up!) The website might ask you to log into some part of the site via a secure method, but once you were in, you’d be able to view the records of anyone on the system! Clearly a terrible thing, but this type of security hole has always existed; here’s an example from a few years ago, concerning Apple’s password reset process.
But it’s ok, we know about GUIDs now! GUIDs to the rescue! If the URL looks like this:
, and the website retrieves each patient’s private information via a GUID, then we’re perfectly safe, no?
No, not safe (sense 1)
In the sense that a GUID is ‘hard for a human to guess’, yes, you’re pretty safe. No-one is going to randomly type in some digits and hit upon a valid GUID. BUT:
1. What if it’s not someone typing? What if it’s a piece of software making thousands of guesses per second? What if it’s a distributed botnet, millions of pieces of software each making thousands of guesses per second? * And your site has millions of users? The probability of making at least one correct guess would not be negligible.
[* Presuming your server can cope with all those requests…]
2. What if your GUID isn’t all that random? There are people out there who, given enough examples, can derive information about your server’s random number generator, and make accurate guesses about historical and future GUIDs. The situation is worse still if you’ve used sequential GUIDs; and potentially catastrophic if the developer who’s written the code hasn’t really understood the point of them (see previous post).
Point 1 can be addressed by putting checks in place for incorrect guesses. Modern server set-ups will allow you to block IP addresses after unusual patterns of requests have been detected.
Point 2 is harder to address: how truly random is the code that generates your GUIDs? In practice, you just don’t know. The study of random number generation is an entire academic discipline of study, you could devote the rest of your life to assessing various random number generators!
But the above doesn’t matter because…
No, not safe (sense 2)
This is the crux of the matter: obscurity (hiding something) is not security. All it does is make something harder to locate, not impossible. Just because you can’t guess a GUID, it doesn’t mean there aren’t other ways of obtaining them, e.g.:
- Leaking (e.g. via accidental email forwarding)
- Social engineering
- Mistakes (e.g. devs ‘temporarily’ storing them in webpage HTML)
The point is that proper security needs to be applied on top, in all cases. Where sensitive information is concerned, people should be logging in securely, with as robust a system as it’s possible to use. So even if a bad person stumbles across information (e.g. GUIDs) they shouldn’t have, they can’t use it, because they don’t have the requisite access.
To sum up, using a GUID on its own is nowhere near adequate-enough security, there’s much more to be done. A final tip: Given that there’s no such thing as “100% secure”, your goal should be to record all user activity on your site / app, and dive into it regularly to check for breaches or unusual patterns. As a bonus, you get to see how users are really using your software — I promise there’ll be some surprises in there!