Building on my post about 7 sins of security metrics, here are 4 principles I’ve found valuable when building metrics with CISOs and security teams, along with an explanation of why.
This isn’t a comprehensive list, and I’d be keen to hear any thoughts about other areas or differences in opinion on this topic.
- When you start developing security metrics for a problem area, don’t plunge into trying to analyse ‘risk’. Instead, start with measuring the performance and efficacy of security process.
Often, to the stakeholders receiving them, the things security teams report as ‘risk metrics’ either seem abstract (e.g. number of technical-sounding-things that happened), or subjective (probability that bad-thing-X will occur). As one head of analytics at an investment bank put it “Our CFO hates our risk management meetings because they look at these numbers we give them and have no idea if they mean we’re better or worse than last time … whether everything’s all good, or they should be worried.”
Then let’s add to the mix that security’s stakeholders (typically CxOs, IT teams, business lines, risk/audit) have different opinions about what ‘risk’ actually is – either because of their role, or their incentives. With all this put together, it’s easy to see why we’re in an uphill battle to create shared understanding of a problem across a diverse group of people, all of whom need to pull in the same direction for security to get stuff done.
Making the shift from reporting risk to reporting process goes a long way towards tackling these challenges. You stand a much better chance of creating common ground for a conversation about improving security by delivering insights about how successful a process is right now and what actions can deliver the greatest uplift in effectiveness and/or efficiency.
As one CISO we work with put it: “Leading with process has made it easier to drive acceptance of a metric by our stakeholders, because it gives us a common language that makes more sense than ‘risk’ … at least in the first instance.” This is not only because you can put process metrics in the context of other people’s roles, decision-making power and accountability. It’s also because as a starting point, process pulls on a thread of actions and options that connects all the way from analysts and engineers, up to operational managers, and to CxOs who control the purse strings from one quarter to the next.
P.S. Anecdotally, a few security officers we’ve spoken to recently tell us they are seeing focus at Board level shift away from risk metrics and towards performance metrics, because as one CISO put it, Boards see risk metrics as being “things that create more work than they solve”.
- Before you start creating metrics, invest enough time to understand the reality of the processes you are measuring. Do not hope that metrics will reveal process to you; at best they will show you a shadow of the truth.
If you don’t fully understand a process before you set out to build metrics, what you end up with will inevitably be outcome-focused. Your metrics may answer the question: ‘Does our data reflect my expectations of what our environment should look like?’ However, if the answer is ‘No’, you’ll be unable to advise on next best actions, because you won’t understand what processes, or lack thereof, created the result you’re looking at. (Also, if the result looks good but is achieved by massively inefficient means then you won’t know if you need to improve efficacy).
In our experience, to avoid this, it’s vital to get into the trenches and understand first-hand how things are really done.
For starters, accepted process may be very different from what’s written down, or quoted to you when you first ask. Even if you ask lots of good questions in discovery meetings, there will be lots of implicit knowledge and assumptions in people’s heads that they won’t think to mention, either because they’re so used to them, or they won’t view them as noteworthy. (Fun exercise: try to accurately describe making a cup of coffee and see if you don’t miss something out!) Also, there will be lots of nuances in how things are done by different IT teams who support different platforms in different geographies. Appreciating all this up front and engaging with the people involved will help massively with usability, and therefore acceptance, of the metrics you eventually build.
The other reason getting close to process is important is that you’re on a quest to develop metrics that are ‘good enough for today’. So rather than over-engineering for an end state on lots of long and tedious conference calls with 20 people that result in zero agreement on what to measure, it’s far more productive to figure out where you are now process-wise and how to get somewhere better in as big a step as is feasible for your organisation – all in a justifiable timescale.
If you don’t do this, you risk ending up trying to deliver tectonic shifts in global processes with an admirable (if perhaps lofty) goal in mind. Meanwhile nothing happens because process / governance change is hard, the metrics don’t get better, people get frustrated and you lose momentum.
- Create one metric per process you’ve identified, then apply it only to the data a process can impact. For a given dimension of a security control (e.g. ‘vulns beyond SLA’ for Patch Management), you’ll be able to address some of the data that pushes a metric out of tolerance immediately, then track improvement. Other data won’t change until you change, or create, an internal process. So separate these out!
Metrics come into their own when they act first as a tool to help people understand what’s going on, then how to solve problems at best cost/least effort, then how to track progress and measure their success.
As metrics are developed and you get new views on what your data is telling you, you’ll likely uncover things that are slipping between the cracks of existing processes. If you have a metric that tracks the current status of a given process, but you require several projects to deal with legacy issues that existed before the process was created/updated, there is no harm in splitting that historical data out, and tracking those remediation projects separately. This will help you avoid the situation where your metric confuses good current performance with past problems that are now under management via a different – and possibly longer term – process.
This will ultimately allow you to present metrics to your risk committee with more confidence in how they can be improved and on what timescales, as they are directly influenced by specific actions.
Measuring everything together, (i.e. lumping together data representing a problem that has processes or initiatives in place to manage it, as well as data that relates to an issue that no one is acting on), will just lead to confusion. Worse, it may also obscure good progress made by your team! A metric shouldn’t penalise anyone for areas they’ve not been tasked with working on.
- If a metric highlights instances of process failure, figure out if those instances are outliers, or if there is a systemic problem. Actions to remediate these problems are likely to either be different, or require different timescales to address.
Sure, we all hope that processes will result in achieving an expected level of security/reduction in exposure to impact – and that they’ll be followed to the letter. But, where this isn’t the case, we need to dig into why.
For example, let’s say we’re experiencing a growing number of vulnerabilities across our estate. Is this a systemic problem? Are all machines equally bad? Or do we have specific outliers? Do a small number of devices account for most of the vulnerabilities we are detecting? (AKA, an 80/20 problem.)
If all devices are affected, this suggests the process we have is not delivering the result we want, so we need to review it. On the other hand, if we have an outlier problem, this suggests our process is working for most of the estate … so why is it failing on these machines?
By figuring this out, we can understand whether a very focused effort will solve the problem. Is there a commonality across the machines causing the issue (i.e. software, platform, geography, business service they support)? If so, we can come up with a targeted campaign of action to dramatically improve the overall health of the environment. If the issue is scattered across the estate with no obvious pattern, we need to go and speak with the device owners to work out what’s happening.
PHEW! That’s probably enough for now 🙂
I’ll post again soon to draw out some of the points above in more detail, with a specific example around vulnerability management metrics. In the meantime, if you have any questions, please let me know below…