05
Feb 21

Observability

TL;DR:
Our industry doesn’t have the right definition of Observability and it’s going to cost enormous amounts of money and cause continued outages until that changes. ((Normally in this blog I don’t talk about things close to work much, because I’m a software engineer at a large company and engineers in that position leave work at work most of the time lest we be perceived as speaking for our company by people who don’t yet understand that companies do not usually let engineers handle public relations for rather good reasons. This post is an exception because it’s a very nonspecific thing that impacts our entire industry, not just one company. It’s my thoughts on this, not my company’s (they don’t pay me enough to work in PR).
For context, I’ve been working in monitoring for a decade or so and relying on it for about three decades. My original background was in both electronics and computers, so it focussed more on control theory and statistics and the device driver level of programming than it did on writing word processor applications and the like. Because of that I came to the monitoring field with a formal background in what monitoring is actually for and that influences my opinions.))

I’ve been getting increasingly frustrated with something for the last year or four. Mostly it gets dismissed because it’s difficult to monetise even though it’s not difficult to actually understand or implement. Engineers are so close to it that they rarely think about it and vendors think it’s not readily profitable which is a situation that lends itself well to being deliberately ignored. Specifically, I’m talking about the large gap between what our industry calls Observability and what Observability actually is.

And that gap is not some academic triviality – it means that there is a massive gaping hole in our tooling which directly results in lost revenue and a lot of grief and unnecessary work for engineers.

Observability tooling is simply not fit for purpose right now, not from any vendor. It’s not enough to have nice UIs and to be able to click from a metric to a log to a trace. Those are “nice to haves”. What we actually need to have is support in our tooling to let engineers readily codify their own mental models of how their service works so everyone else can see what’s going on without needing all the context and experience that the original engineers had to have to build up those mental models.

Detection of the difficult, nuanced nasties in our systems, and correctly answering hard questions like “how badly will this AWS API degredation affect our app’s latency” or “how much is it going to cost to run this service in two years at this growth rate” cause headaches in almost every company at every scale above “we have just one engineer and she knows how everything works”. Answering those questions correctly while engineers’ mental models are still stuck in their heads is often difficult or even impossible, because the answers are formed by combining several people’s knowledge and it won’t all fit in one head. We need to be able to codify models because that lets other people using them with their data to answer these questions, or even automate answering them.

But right now we can’t do this easily, because in the monitoring industry over the last few years, the term “Observability” has been used exclusively as a marketing term; at first by a few vendors, and then more and more, but they have all used it to mean, in essence, “our product”. None have used the actual definition of the term or anything close to it, and so nobody has been working on the missing tools we need to achieve it.

The difference between what the term actually means and what we’ve been using it for is why just pushing to adopt what any particular vendor describes as observability tools isn’t a solution to any monitoring team’s problems. Observability in reality is a very fundamental concept and like most fundamentals, it can be stated in a very straightforward and easy to understand way. In five steps, it goes like this:

1. We have a system. This system can be anything – whether that is your company’s application as a whole, a particular service, a particular host or anything else does not matter.

2. We have a model of this system which has some number of internal states. Whether that’s two states (“Working” and “Broken”) or two million does not matter.

3. We have some outputs from the system. Whether these are in the form of metrics, logs, traces or something else does not matter.

4. If we can tell what internal state the system is in just by watching the outputs, then that system is Observable. If there are states the system can be in, but we cannot tell that we are in them from watching the outputs, then that system is only partially observable.

5. The system’s Observability is a measure of how many states we can tell we are in compared to how many states there are in our model.

Yes, I know that the Wikipedia definition is more involved. And if you’ve studied state space modelling and done a few courses on control theory, the Wikipedia definition is a better one. If you’re also familiar with the history of formal methods in computer science, you’ll know why this doesn’t really matter; and if not, the short version is that we as an industry can’t really create formal models for our systems because we don’t understand them well enough and we don’t teach the math for that to enough engineers, so a simple view of Observability is all we can use right now. But it is sufficient to do real work with.

Mostly the industry has looked at the original definition of Observability – when it’s looked at it at all – from the point of view of outputs, and making outputs more readable and making it easier to create new outputs. That’s where we got the deeply unhelpful “three pillars of observability” meme from. But for us as engineers and for companies in general the far more important part today is the other critical component the definition talks about — it requires that we have a model of the system. Monitoring is necessary but not sufficient to give us Observability.

Right now, we can pick any vendor you choose, build the ultimate data platform, ship all the metrics and logs and traces and any other data we can think of so that we can view it at any resolution, at any frequency of updates, in any combination, using any graphical display and we still won’t know what is going on without that model. We could have built the world’s best monitoring system and we wouldn’t be able to tell what’s going on.

Right now, that model only exists in the heads of us as engineers (especially those of us who pull shifts on-call). We can read some specific metrics or logs and tell you that a particular service is broken and how badly and why and what has to be done to fix it. And if one of us walks in front of a bus tomorrow (or, y’know, catches Covid-19), or gets poached by another company, or retires; well that model is now lost for as long as it takes another engineer to relearn it, if they can. It also means that onboarding of new hires takes longer than it has to, and that our systems are more opaque to the rest of the company.

Don’t believe me? Sit a C-level executive or a new hire down in front of a raw dashboard and ask them what the system is doing. They won’t have the context, so they won’t know. If they don’t give up immediately and think it’s a prank, they’ll probably look at the dashboard and in front of your eyes, try to build a model from scratch using any metadata they can find like the titles of graphs or the units on axes or using past data in the graph to see if the last hour looks different from the last week. You’ve probably seen new hires doing this during their onboarding if you were supervising, training or mentoring them.

This is the largest weakness in the monitoring industry’s market offerings at present. The industry is wholly focussed on extracting outputs in various forms from the system and presenting them to engineers in various ways, or giving them toolkits to do so themselves. Competition has centered on how many data points a vendor can ingest, on their pricing models, on comparative evaluations of their UIs and several other things of little fundamental merit. Very little attention has been given to developing tools that allow us to extract the model from the engineers’ heads and codify it so that people without their institutional knowledge can understand what is going on.

Yes. We’ve all heard of the wonderful sunlit uplands described with marketing terms like MLOps and AIOps, and there’s a wealth of things that could be said about this citing a history going back through Machine Learning, AI, Big Data, Heuristics, AI again, Expert Systems and even further back depending on how long you’ve been in this industry, but suffice it to say that these systems cannot do this job, they explictly have no deeper context and just try to find patterns in data without understanding the data. When the lead of Google’s ML-for-SRE program says so openly in conference talks, maybe the time has arrived for the industry to accept that throwing data at a black box and hoping for magic to happen just isn’t a workable plan here.

Worse yet, the current trend is to talk about how Observability solves the “unknown unknowns” problem, ie. recovery when the system is in an unknown state. This is a category error. Observability is basically the ratio of the number of states you can tell you’re in to the number of states in your model; if you’re in an unknown state then your model was incomplete and how can you now use observability to fix your model? The idea is gibberish.

The reality is that if you are in an unknown state, you don’t know what’s going on, by literal definition. You need to build a new model or extend an old one to cope. This puts you in the category of fault diagnosis, not monitoring. You will be actively opening up the black box of the system and poking around to figure out what happened and that puts you outside the entire mathematical framework of observability. In that scenario, Observability is the end goal, not the means to get there. And you will need different tooling and there’s precious little going on in that space either.

The obvious question at this point is, “So what? Why does this matter? How does it cost us money to not do this?”. Okay, let’s look at the money.

Monitoring SaaS providers have various different billing models but to one extent or another, they all boil down to this: you pay according to the number of data points you produce. Produce more points, pay more money. This is less pronounced for on-prem solutions but their resource requirements mean that the same effect is there, just less directly measured and harder to plan for. This creates a tension between Engineering teams wanting to ship all the metrics they can at a very high frequency; and Finance teams wanting to reduce the final bill that creates. And one of the oldest tropes here is when Finance asks “do you need all these metrics” and Engineering says “I don’t know what I need to know until I need to know it”. This Sir Humphrey-esque response is caused by not having an adequate model of the system. Teams who have such a model, even informally in their heads, know what they need to see as outputs, and how often. It lets them produce data points showing what they need to know and not bother with what they don’t care about. It reduces their costs. The more they codify these models, the lower the final cost.

When something does go wrong, having a model of the system that isn’t in an engineer’s head makes automated fault detection possible, and opens the door to the possibility of automated or assisted remediation (this leans into the next buzzword the monitoring industry is going to co-opt, Controllablity), all of which reduce time to restore service and allows companies to both meet SLAs and to have tighter SLAs and reduces penalty payouts and makes sales easier as your reputation for reliability increases.

Beyond the immediacy of operational requirements, when you have a model of a system, you can make better predictions about what it will do in the long term. This means predicting the costs of meeting teams’ and customers’ needs. It means the ability to invest when it’s needed and not until needed, and it lets you understand the right amount to invest.

In other words, if we as an industry want to improve reliability, reduce opex, optimise capex and know what is actually happening with our systems as it happens, then our true goal is to build an observable system and that requires two things – monitoring and modelling.

We have monitoring today and we will continue to improve it. We do not have modelling, and no vendor seems interested in providing it right now. Instead we find ourselves witnessing a rerun of the Intel-v-AMD frequency wars in a new space as competing vendors claim they can ingest more data points per second than their competitors (and as happened in the frequency wars, are looking to pivot to a new metric like cardinality as they hit fundamental limits), as if that was what was needed. Or we see them make up a definition of what Observability is and try to leverage that, as if that could help.

As an engineer at the coal face, I’d be happier if they would just build the tools we desperately need instead.


27
Jul 11

Home server build, part one – specifications

So between my laptop and herself’s, we have a fair amount of valuable (to us) data – MSc essays and coursework, book manuscripts, half a gigabyte of open source projects, Phd programming work, wedding photos and video, and about 19 gigabytes of other photos and video, eight gigabytes of target shooting documents and images, half a gigabyte of academic papers… well, you get the idea. So when my laptop hard drive started to hiccup and its SMART report started complaining of bad blocks and imminent failure within 24 hours… well, it prompted some concern 🙂 Most of the important data was backed up on my server (which is not just off-site, but out of the country) using rsnapshot, but there’s nothing like an incipient disaster to make you review your disaster recovery protocols 😀

Besides, I had been planning for some time to offload the bulk data storage (video files and so forth) to a NAS, and while buying an off-the-shelf NAS box is certainly an option:

  • building your own can give you more capability for less outlay;
  • building your own allows you more functionality than just NAS storage – in this case, I had a few other tasks in mind for this box;
  • building your own is something every sysadmin should do for CPD if nothing else. 🙂

So, what’s the specification? The list of tasks is fairly straightforward to start with:

  • NAS storage
  • Central print server (and scanner at some point)
  • Backups of both laptops and off-site storage of those backups
  • Central downloading server for bittorrents and so on
  • Media server

None of these need much in the way of CPU oomph, though we would need multiple cores as the storage will be some form of software RAID array (and having multiple cores would give more performance than a faster single-core, at least for a given amount of outlay). Not a huge amount of RAM is needed either. But we do need a number of disk interfaces, gigabyte ethernet (and the central router for the network has since been upgraded from the standard ISP’s Zyxel to a gigabyte ethernet Netgear router), and if this is all on the motherboard, so much the better.

And obviously, outlay’s an issue as well. Buying an off-the-shelf NAS box (a Synology DS411+) and stocking it with disks (4x Hitachi 2Tb Deskstar drives) would cost approximately €960 (priced on scan.co.uk) so the goal was to get in below that threshold.

First off, the CPU. I’m going with the AMD Athlon II X4 645. It’s a quad core processor but quite cheap. It’s a socket AM3 processor, which leads us to the selection of the motherboard, and I’ve chosen a Gigabyte GA-880GA-UD3H, which is the cheapest Gigabyte AM3 board which had six SATA ports and Gigabyte ethernet ports. Add in a fairly cheap 4Gb of RAM (and pause to remember the time back in ’97 just after graduation when we watched with some degree of awe when the TCD sysadmins showed off a whole gigabyte of RAM, which cost about three months disposable income…) and a fairly standard Artic Cooling Freezer 7 for the CPU and that’s the guts of the thing.

Now for storage. The idea here is a degree of future-proofing, and a lot of I-never-want-to-lose-any-data-or-have-much-downtime 😀 So, the OS is going on two 320Gb Seagate Barracuda hard drives in a RAID 1 array. The main data storage will go on four 2TB Hitachi CoolSpin 5K3000 drives, which were chosen after this particularly excellent article from Backblaze about their 135Tb storage pods. And to try to keep stable power lines, an 850W Coolermaster modular PSU. And since this rig will have to last for a while and I hate slicing my hands open on cheap cases, a Lian Li PC-8NB case to mount all of this in, together with an Icy Box IB554SK SATA RAID frame because when the hard drives fail (and all drives will), I don’t want to have to disassemble the entire box to fix the problem (though, yes, if the OS drives I might have to, but you could always add another Icy Box or similar frame – I was watching the outlay myself). And then just to top it off, a DVD RW drive to make installation a bit easier.

[stextbox id=”info”]For those who’ve been asking on Hacker News, yes, the PSU here is over-specified. That’s a deliberate choice, because I’d rather come nowhere near the limits (or even the 50% mark if possible) of the PSU for two reasons – stability of voltage lines and cooling. And yes, the CPU could be an Atom, but I’ve chosen instead to go with something a bit more conventional and old and unfashionable and debugged because this is an infrastructure box and I’m willing to pay a few euros more upfront to avoid spending a week of my time trying to fix it in a years time, or having to buy more new hardware because it turned out we needed it to do something more than it can do right now and an Atom couldn’t cut it. And yes, 320Gb hard drives. Actually, I changed my mind and went with the 500Gb ones after writing this post; they turned out to be 6 pence cheaper than the 320GB model, which I had thought was the cheapest available hard drive not made by McFlakey Inc. They’re not an example of overspecification, they’re an example of consumer electronics pricing…[/stextbox]

So the full list:

Icy Box IB-554SKIcy Box IB-554SK£83.87
Corsair DDR3 XMS Classic4GB (2x2GB) Corsair DDR3 XMS3 Classic, PC3-10666£25.68
Artic Cooling Freezer 7 Pro Arctic Cooling Freezer 7 Pro v2£18.74
850W Coolermaster Silent Pro850W Coolermaster Silent Pro M£109.02
Lian Li PC-8NBLian Li PC-8NB£83.98
Gigabyte GA-880GA-UD3HGigabyte GA-880GA-UD3H£79.98
AMD Athlon II X4 645AMD Athlon II X4 645£77.51
Pioneer DVR-S19LBKPioneer DVR-S19LBK 24x DVD±R£17.92
Seagate Barracuda 320Gb320GB Seagate ST3320413AS Barracuda£57.31
Hitachi Coolspin 5K3000 2Tb2TB Hitachi 0F12117CoolSpin 5K3000£209.95
Total£788.62 (€~€895)

So that’s under the target price, so the order went in on Scan this weekend and is due for delivery in a day or so…

(continued in part two)


20
Jul 10

My interviews at Google

Google LogoSo I’ve now completed the interview process twice with Google (once in 2007 and once in 2010), and while I’m not sure advice from someone not hired after two run-throughs is all that useful, I figured that the more information out there for those undergoing pre-Google-Interview stress, the better, so here’s how it went.

In both cases, I was contacted out of the blue by a Google recruiter. The first time I had been considering looking for a new role and pursued it immediately; the second time I hadn’t been and put off the recruitment process for several months, during which the same recruiter contacted me again twice to follow up. If nothing else, that’s a nice ego boost, but a more cynical mind might be considering the shotgun approach to a narrow recruiting filter and commissions 😀

First, a quick data point, I was applying for an SRE(SA) position on both occasions – Site Reliability Engineer (System Administration), because in most of my roles to date, I’ve been doing both sysadmin and development work and I’ve never seemed to drift towards one pigeonhole or another. SRE(SA) seemed optimal – interesting sysadmin work on large-scale systems and quite a bit of tool-writing to boot. This was decided on between myself and the recruiter, based on the self-assessment form you are given to fill out. I would love to know how they get around illusory superiority and the Dunning-Kruger effect with those forms, especially given the wierd bias they’d have in the dataset from having so many of the best in their fields working there.

Both times, the process proceeded in the same way: Continue reading →