Distributed Systems Magic: K-way Merge in Thanos

Oops, it has been such a long time since I wrote a post that I even accidentally forgot to renew my site! That’s bad. I definitely owe you a few new posts. Thus, here is another post in the distributed systems magic series.

Recently I undertook the task of improving the proxying logic in Thanos. If you are not familiar with it, the Query component of Thanos sends out requests via gRPC to leaf nodes to retrieve needed data when it gets some kind of PromQL query from users. This article will show you how it worked previously, how it works now, and the learnings from doing this.

It assumes some level of familiarity with Go, Thanos, and Prometheus.

How Proxying Logic Works Currently (v0.28.0 and before)

Since it compares each node with every other node, the complexity of this is Θ(kn):

We can improve by first of all using a different data structure. You can easily see that a direct k-way merge is not the most efficient one – it is unnecessary to compare everything all the time. Another thing is that *mergedSeriesSet uses a buffered Go channel inside which means that we cannot receive messages from leaf nodes as quickly as possible. Let’s fix this.

How Proxying Logic Works Now (v0.29.0 and after)

First of all, let’s start using a data structure that allows doing k-way merge much faster. There are different options here but since Go has a pretty good heap container structure already in the standard library, we’ve decided to use that. With this, we have moved to logarithmic complexity: O(nlogk) instead of Θ(kn).

Another major performance bottleneck is the usage of a channel. I vaguely remember a meme on Twitter that went something along these lines: “Go programmers: use channels. Performance matters? Don’t use channels”. In the new version, the proxying logic uses time.Timer to implement cancelation on a Recv() (a function that blocks until a new message is available in a gRPC stream) that takes too long. Please see this awesome post by Povilas to get to know more information about timeouts in Thanos. This means that channels are no longer needed.

Finally, there has been a long-standing problem with the PromQL engine that it is not lazy. What this means is that ideally all of the retrieved data should be directly passed to the PromQL engine as fast as possible while it iterates through series & steps. However, right now everything is buffered in memory, and the query is executed only then. Hence, this refactoring should also allow us to easily switch between lazy and eager logic in the proxying layer.

Here is a graphic showing the final result:

respSet is a container of responses that can either be lazily retrieved or eagerly. dedupResponseHeap is for implementing the heap interface. In this way, the retrieval strategy is hidden in the container and now the code is tidy because the upper heap container does not care about those things.

You can see that visually this looks like the previous graphic however we are now actually taking proper advantage of the tree-like structure.

Learnings

I have learned lots of things while implementing this.

First of all, looking at something with fresh eyes and lots of experience maintaining code brings a new perspective. I remember someone saying that if after a year or two you can still look at your old code and don’t see much or any potential improvements then you aren’t really advancing as a developer. I reckon that it is kind of true. Same as in this case – after some time and after some thinking it is easy to spot improvements that could be made.

Another thing is that benchmarks are sometimes not valued as much. Go has very nice tooling for implementing benchmarks. We should write benchmarks more often. However, my own personal experience shows that companies or teams usually focus on new features and benchmarks/performance becomes a low-priority item for them. My friend Bartek is working on a book Efficient Go that has some great information and learnings about benchmarks. I recommend reading it once it comes out.

Finally, I think all of this shows that sometimes bottlenecks can be in unexpected places. In this case, one of the bottlenecks was hidden deep inside of a struct’s method – a channel was used for implementing per-operation timeouts. This is another argument in favor of benchmarking.

Now, with all of this, the proxying layer is still not fully lazy because there are wrappers around it that buffer responses however that is for another post. With all of the improvements outlined in this post, it should be quite trivial to fully stream responses.

Thanks to Filip Petkovski and Bartek Plotka for reviewing & testing!

“Observability Engineering” Book Review

A great, new book “Observability Engineering” came out very recently and I had to jump on reading it. Since it is very closely related to my work, I devoured the pages and read the book in about a day (505 pages). While doing so I wrote down some thoughts that I want to share with you today. They might or might not be true, I am only speaking about the book from my own perspective. Feel free to share your own thoughts!

Overall, the book really resonated with me and it makes me very happy to see literature being written about this topic. Observability is a relatively novel concept in computing that I think will only become more popular in the future. I’d rate the book 4/5 in general but it is 5/5 between books on the same topic.

Here are my thoughts.

  • First of all, it is interesting to see tracing used in CI processes to reduce flakiness. But this probably only matters on a huge scale that most companies will not achieve. At least I haven’t worked at companies so far where it is the case. This has also reminded me of a project to put Kubernetes events as spans. Check it out if you’re interested. I hope to work on distributed tracing projects in the near future, it’s a really exciting topic.
  • Chapters by Slack engineers sometimes felt a bit like an advertisement for Honeycomb. The chapter about telemetry pipelines and their bespoke solutions felt a bit too simplistic because we have things like Vector nowadays not to mention Filebeat and so on. What’s more, Slack engineers have created their own format for storing spans. It seems like a lot of companies nowadays suffer from the “not invented here” syndrome which seems to be the case here. I would be surprised if they won’t migrate to OpenTelemetry (OTel) data format in the near future.
  • Authors spent lots of time talking about and praising OTel. Given that traces are specifically formatted logs, it’s not surprising to see the popularity of OTel. It’s a really exciting project. But we have to keep thinking about events in a system that mutates its state. Traces are only a way of expressing those changes in state.
  • The chapters about finding observability allies are enlightening. I have never thought about customer support and other people as allies that could help one instill a culture of observability in a company.
  • The observability maturity model is great and I could foresee it being used extensively.
  • Event-based service level objectives (SLOs) should be preferred to time-based ones because with distributed systems partial outages are more common than complete blackouts. Event-based SLOs is where you count the good events and the bad events in a window and divide the number of good events by the total number of events. Whereas in time-based SLOs you need to divide the time where some threshold has been exceeded by the amount of time in the window. Also, event-based SLOs reflect the reality more – instead of judging each period of time as either bad or good, with event-based SLOs it is possible to precisely tell how much error budget we’ve burned. Somehow even though I’ve worked with monitoring systems for a long time, such two different points of view escaped me. I will always try to prefer event-based monitoring now.
  • At my previous companies, I saw the same bad practices as outlined in the book. If there are barely any requests in the middle of the night then one or two failures don’t mean much and it’s not needed to alert on those conditions. I am talking about payment failures in the middle of the night if most of your clients are in one or several related timezones, for example. What’s more, I have experienced a bunch of alerts based on symptoms that don’t scale. For example, there are such alerts as “RAM/CPU is used too much”. Just like the authors, I would be in favor of removing them because they are pretty much useless and is reminiscent of the old way of using monitoring systems. I guess this is associated with the observability maturity model that is outlined in the book. My anecdotal data says that many companies are still in their infancy in terms of observability.
  • Lots of text about arbitrarily wide structured events. In an ideal world, we could deduce the internal status of service through them but I believe that it is not it all and not end it all signal. It is just one of many. If instrumentation is not perfect then it is a compression of the state space of your application. And with too much instrumentation there is a risk of high storage costs and too much noise. Sometimes it sounds like a solution to a problem that should be solved in other ways – making services with clearer boundaries and less state. Or, in other words, reduce the sprawling complexity by reducing non-essential complexity to a minimum.
  • I agree with the small section about AIOps (artificial intelligence operations). In general, I feel that it applies to anomaly-based alerting as well. How can computers tell whether some anomaly is bad or not? Instead, we should let computers sift through piles of data and humans should attach meaning to events.
  • I agree with the authors’ arguments about monitoring – again, I believe it’s a cheap signal that is easy to start with, and in my opinion, that’s why so many people rely on it / start with it. It is the same with logs. It is very simple to start emitting them. Distributed tracing takes a lot more effort because you not only have to think about your state but also how your service interacts with others. But, that’s where all of the most important observations lie in the cloud-native world.
  • The book is missing a comparison of different types of signals. The authors really drive the point of arbitrarily wide events but I feel like that isn’t the silver bullet. What about continuous profiling and other emerging signals? Probably not surprising given how much the authors talk about this topic on Twitter.
  • The example of how a columnar database works didn’t convince me and it felt out of place. It probably just needs a better explanation and/or a longer chapter. I would probably recommend you pick up a different book to understand the intricacies of different types of databases.

Of course, my notes here can’t represent all of the content of the book. I’d recommend you to read it yourself! It’s really great. Let me know what you think about it in the comments.