“Observability Engineering” Book Review

A great, new book “Observability Engineering” came out very recently and I had to jump on reading it. Since it is very closely related to my work, I devoured the pages and read the book in about a day (505 pages). While doing so I wrote down some thoughts that I want to share with you today. They might or might not be true, I am only speaking about the book from my own perspective. Feel free to share your own thoughts!

Overall, the book really resonated with me and it makes me very happy to see literature being written about this topic. Observability is a relatively novel concept in computing that I think will only become more popular in the future. I’d rate the book 4/5 in general but it is 5/5 between books on the same topic.

Here are my thoughts.

  • First of all, it is interesting to see tracing used in CI processes to reduce flakiness. But this probably only matters on a huge scale that most companies will not achieve. At least I haven’t worked at companies so far where it is the case. This has also reminded me of a project to put Kubernetes events as spans. Check it out if you’re interested. I hope to work on distributed tracing projects in the near future, it’s a really exciting topic.
  • Chapters by Slack engineers sometimes felt a bit like an advertisement for Honeycomb. The chapter about telemetry pipelines and their bespoke solutions felt a bit too simplistic because we have things like Vector nowadays not to mention Filebeat and so on. What’s more, Slack engineers have created their own format for storing spans. It seems like a lot of companies nowadays suffer from the “not invented here” syndrome which seems to be the case here. I would be surprised if they won’t migrate to OpenTelemetry (OTel) data format in the near future.
  • Authors spent lots of time talking about and praising OTel. Given that traces are specifically formatted logs, it’s not surprising to see the popularity of OTel. It’s a really exciting project. But we have to keep thinking about events in a system that mutates its state. Traces are only a way of expressing those changes in state.
  • The chapters about finding observability allies are enlightening. I have never thought about customer support and other people as allies that could help one instill a culture of observability in a company.
  • The observability maturity model is great and I could foresee it being used extensively.
  • Event-based service level objectives (SLOs) should be preferred to time-based ones because with distributed systems partial outages are more common than complete blackouts. Event-based SLOs is where you count the good events and the bad events in a window and divide the number of good events by the total number of events. Whereas in time-based SLOs you need to divide the time where some threshold has been exceeded by the amount of time in the window. Also, event-based SLOs reflect the reality more – instead of judging each period of time as either bad or good, with event-based SLOs it is possible to precisely tell how much error budget we’ve burned. Somehow even though I’ve worked with monitoring systems for a long time, such two different points of view escaped me. I will always try to prefer event-based monitoring now.
  • At my previous companies, I saw the same bad practices as outlined in the book. If there are barely any requests in the middle of the night then one or two failures don’t mean much and it’s not needed to alert on those conditions. I am talking about payment failures in the middle of the night if most of your clients are in one or several related timezones, for example. What’s more, I have experienced a bunch of alerts based on symptoms that don’t scale. For example, there are such alerts as “RAM/CPU is used too much”. Just like the authors, I would be in favor of removing them because they are pretty much useless and is reminiscent of the old way of using monitoring systems. I guess this is associated with the observability maturity model that is outlined in the book. My anecdotal data says that many companies are still in their infancy in terms of observability.
  • Lots of text about arbitrarily wide structured events. In an ideal world, we could deduce the internal status of service through them but I believe that it is not it all and not end it all signal. It is just one of many. If instrumentation is not perfect then it is a compression of the state space of your application. And with too much instrumentation there is a risk of high storage costs and too much noise. Sometimes it sounds like a solution to a problem that should be solved in other ways – making services with clearer boundaries and less state. Or, in other words, reduce the sprawling complexity by reducing non-essential complexity to a minimum.
  • I agree with the small section about AIOps (artificial intelligence operations). In general, I feel that it applies to anomaly-based alerting as well. How can computers tell whether some anomaly is bad or not? Instead, we should let computers sift through piles of data and humans should attach meaning to events.
  • I agree with the authors’ arguments about monitoring – again, I believe it’s a cheap signal that is easy to start with, and in my opinion, that’s why so many people rely on it / start with it. It is the same with logs. It is very simple to start emitting them. Distributed tracing takes a lot more effort because you not only have to think about your state but also how your service interacts with others. But, that’s where all of the most important observations lie in the cloud-native world.
  • The book is missing a comparison of different types of signals. The authors really drive the point of arbitrarily wide events but I feel like that isn’t the silver bullet. What about continuous profiling and other emerging signals? Probably not surprising given how much the authors talk about this topic on Twitter.
  • The example of how a columnar database works didn’t convince me and it felt out of place. It probably just needs a better explanation and/or a longer chapter. I would probably recommend you pick up a different book to understand the intricacies of different types of databases.

Of course, my notes here can’t represent all of the content of the book. I’d recommend you to read it yourself! It’s really great. Let me know what you think about it in the comments.

Distributed Systems Magic: Groupcache in Thanos

This is a sequel to my previous blog post about trying to migrate to a newer version of protocol buffer API. In this blog post, I will tell you how we’ve managed to get groupcache into Thanos during the previous LFX mentorship. The team consisted of Akansha Tiwari, Prem Saraswat, and me. It would not have been possible to implement this without the whole team.

In the beginning, let’s quickly go over what is groupcache and why it solves a few important problems in Thanos.

First of all, it reduces complexity. Typically key/value storages are separate processes and require extra work to set up. The other two major caching systems supported by Thanos (besides in-memory cache) are Redis and Memcached which both run as separate processes. With groupcache, everything is embedded in the original process itself. Hence, a group of processes becomes a distributed cache.

Most importantly, it has a cache filling mechanism which means that data is only fetched once. This is probably the key benefit of groupcache. This is due to the particular use case of PromQL queries. Quite often dashboards reuse the same metrics in slightly different expressions. For example, you might have two different queries to show the 50th percentile and 99th percentile in a panel:

  • histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{cluster=~"$cluster", instance=~"$instance"}[$__rate_interval]))
  • histogram_quantile(0.50, rate(etcd_network_peer_round_trip_time_seconds_bucket{cluster=~"$cluster", instance=~"$instance"}[$__rate_interval]))

They both would hit the same series because they have an identical set of matchers. Because query-frontend uses the full expression as the key, it means that the underlying Thanos Query happily goes ahead and executes both queries separately. If Memcached or Redis caches are empty and if both queries are being executed in lockstep then data would get fetched and stored twice:

  • Thanos Store checks if needed data is in memcached/redis? -> no
  • Fetch data from remote object storage
  • Store data in memcached/redis

With groupcache, such a problem doesn’t occur because every peer in a group knows about every other peer via DNS, and the whole universe of keys in cache is consistently divided between those peers. So, if any node in a cluster wants some information then it sends a request to a node responsible for that key. As a result, data is only loaded once and then spread to all peers. This is amazing! 🎉

There is another benefit of this – mirroring of super hot items. If one node asks for some key more often than for other keys then it can save that data in memory. This avoids key hotspotting!

But enough about the benefits of groupcache – now let’s move to the tricky parts that we’ve encountered during this project.

The original groupcache project is not really maintained anymore. As a result, quite a few forks sprung up. One of the main drawbacks of the original groupcache project that we’ve noticed is forced global state. Original groupcache maintains a global map of registered groups and there is no public function for removing a group. If we would want to add some dynamic registration of groups in the future then this would be a blocker.

Then, we’ve looked at two other prominent forks – mailgun/groupcache and vimeo/galaxycache. The former is the same as the original one except that it has TTL (time to live) support and item removal support. The latter is a completely revamped groupcache that support the dynamic registration of groups. It arguably has a cleaner, more natural interface to Go developers. It even has support for loading data via gRPC calls instead of regular HTTP calls. Thus, we went ahead with it.

At the beginning of this project, we’ve reviewed these forks according to our requirements. Ideally, there would be no global state. For some reason we’ve also thought that we won’t need TTL support (expiring keys) but alas that’s not true. Nowadays Thanos Store stores not just the content of the actual objects in remote object storage but also such things as the list of items. Since some things like the formerly mentioned example can dynamically change over time, it also means that we need to have some kind of TTL support. We only realized this towards the end of LFX, after we’ve added end-to-end tests for the groupcache functionality. So, I took a stab at implementing TTL support in galaxycache: https://github.com/vimeo/galaxycache/pull/25. I implemented it by embedding the TTL information in the internal LRU cache’s keys. The TTL is checked during fetching – if the key has expired then we would remove the key from the internal cache. However, the original developers noticed my pull request and suggested following their strategy for implementing TTL without explicitly embedding that data into the LRU keys. The idea is to divide the keyspace into epochs. With this, we wouldn’t have to have explicit timestamps. This still needs to be done so that we could switch back to vanilla, upstream galaxycache! Help wanted 🙃

Last but not least, since Thanos uses regular Go HTTP servers/clients with HTTP 1.1 for unencrypted traffic and HTTP 2.0 where it is applicable – where transport layer security was used and both server/client supported it, it means that fetches are quite inefficient without HTTP 2.0 with groupcache. Here you can find a comparison between them. For that reason, we had to enable HTTP 2.0 cleartext (H2C in short). I followed this tutorial by mailgun developers to implement it. I even wrote some benchmarks to test it myself. The local performance is more or less the same but what’s important is that with HTTP 2.0 everything happens with a minimal number of TCP connections. Keep in mind that the number of TCP connections with HTTP 1.1 grows exponentially according to the number of nodes in a groupcache’s group because each node has to communicate with every other node. This results in huge performance gains with separate machines in a groupcache’s group talking with each other – there will be no need to maintain so many different TCP connections.

If all of this sounds very nice to you and you would like to try it out then please install at least the v0.25 version of Thanos and follow the documentation here. A huge thanks to everyone involved in this project!!! 🍺 I hope that we will receive some feedback from our users about groupcache!