Distributed Systems Magic: Groupcache in Thanos

This is a sequel to my previous blog post about trying to migrate to a newer version of protocol buffer API. In this blog post, I will tell you how we’ve managed to get groupcache into Thanos during the previous LFX mentorship. The team consisted of Akansha Tiwari, Prem Saraswat, and me. It would not have been possible to implement this without the whole team.

In the beginning, let’s quickly go over what is groupcache and why it solves a few important problems in Thanos.

First of all, it reduces complexity. Typically key/value storages are separate processes and require extra work to set up. The other two major caching systems supported by Thanos (besides in-memory cache) are Redis and Memcached which both run as separate processes. With groupcache, everything is embedded in the original process itself. Hence, a group of processes becomes a distributed cache.

Most importantly, it has a cache filling mechanism which means that data is only fetched once. This is probably the key benefit of groupcache. This is due to the particular use case of PromQL queries. Quite often dashboards reuse the same metrics in slightly different expressions. For example, you might have two different queries to show the 50th percentile and 99th percentile in a panel:

  • histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{cluster=~"$cluster", instance=~"$instance"}[$__rate_interval]))
  • histogram_quantile(0.50, rate(etcd_network_peer_round_trip_time_seconds_bucket{cluster=~"$cluster", instance=~"$instance"}[$__rate_interval]))

They both would hit the same series because they have an identical set of matchers. Because query-frontend uses the full expression as the key, it means that the underlying Thanos Query happily goes ahead and executes both queries separately. If Memcached or Redis caches are empty and if both queries are being executed in lockstep then data would get fetched and stored twice:

  • Thanos Store checks if needed data is in memcached/redis? -> no
  • Fetch data from remote object storage
  • Store data in memcached/redis

With groupcache, such a problem doesn’t occur because every peer in a group knows about every other peer via DNS, and the whole universe of keys in cache is consistently divided between those peers. So, if any node in a cluster wants some information then it sends a request to a node responsible for that key. As a result, data is only loaded once and then spread to all peers. This is amazing! 🎉

There is another benefit of this – mirroring of super hot items. If one node asks for some key more often than for other keys then it can save that data in memory. This avoids key hotspotting!

But enough about the benefits of groupcache – now let’s move to the tricky parts that we’ve encountered during this project.

The original groupcache project is not really maintained anymore. As a result, quite a few forks sprung up. One of the main drawbacks of the original groupcache project that we’ve noticed is forced global state. Original groupcache maintains a global map of registered groups and there is no public function for removing a group. If we would want to add some dynamic registration of groups in the future then this would be a blocker.

Then, we’ve looked at two other prominent forks – mailgun/groupcache and vimeo/galaxycache. The former is the same as the original one except that it has TTL (time to live) support and item removal support. The latter is a completely revamped groupcache that support the dynamic registration of groups. It arguably has a cleaner, more natural interface to Go developers. It even has support for loading data via gRPC calls instead of regular HTTP calls. Thus, we went ahead with it.

At the beginning of this project, we’ve reviewed these forks according to our requirements. Ideally, there would be no global state. For some reason we’ve also thought that we won’t need TTL support (expiring keys) but alas that’s not true. Nowadays Thanos Store stores not just the content of the actual objects in remote object storage but also such things as the list of items. Since some things like the formerly mentioned example can dynamically change over time, it also means that we need to have some kind of TTL support. We only realized this towards the end of LFX, after we’ve added end-to-end tests for the groupcache functionality. So, I took a stab at implementing TTL support in galaxycache: https://github.com/vimeo/galaxycache/pull/25. I implemented it by embedding the TTL information in the internal LRU cache’s keys. The TTL is checked during fetching – if the key has expired then we would remove the key from the internal cache. However, the original developers noticed my pull request and suggested following their strategy for implementing TTL without explicitly embedding that data into the LRU keys. The idea is to divide the keyspace into epochs. With this, we wouldn’t have to have explicit timestamps. This still needs to be done so that we could switch back to vanilla, upstream galaxycache! Help wanted 🙃

Last but not least, since Thanos uses regular Go HTTP servers/clients with HTTP 1.1 for unencrypted traffic and HTTP 2.0 where it is applicable – where transport layer security was used and both server/client supported it, it means that fetches are quite inefficient without HTTP 2.0 with groupcache. Here you can find a comparison between them. For that reason, we had to enable HTTP 2.0 cleartext (H2C in short). I followed this tutorial by mailgun developers to implement it. I even wrote some benchmarks to test it myself. The local performance is more or less the same but what’s important is that with HTTP 2.0 everything happens with a minimal number of TCP connections. Keep in mind that the number of TCP connections with HTTP 1.1 grows exponentially according to the number of nodes in a groupcache’s group because each node has to communicate with every other node. This results in huge performance gains with separate machines in a groupcache’s group talking with each other – there will be no need to maintain so many different TCP connections.

If all of this sounds very nice to you and you would like to try it out then please install at least the v0.25 version of Thanos and follow the documentation here. A huge thanks to everyone involved in this project!!! 🍺 I hope that we will receive some feedback from our users about groupcache!

Things Learned From Trying to Migrate To Protobuf V2 API from gogoprotobuf (So Far)

During the most recent LFX mentorship program’s iteration, I had the honor to work on trying to migrate to version 2 of the protobuf API from gogoprotobuf on the Thanos project with my one and only awesome mentee Rahul Sawra and another mentor Lucas Serven who is also a co-maintainer of Thanos. I wanted to share my technical learnings from this project.

LFX mentorship program’s logo

First of all, let’s quickly look at what protocol buffers are and what is the meaning of the different words in the jargon. Protocol buffers are a way of serializing data. It was first made by Google. It is a quite popular library that is used by Thanos really everywhere. Thanos also uses gRPC to talk between different components. gRPC is a remote procedure call framework. With it, your (micro-)services can implement methods that could be called by others.

Since both were made by Google originally, it is not surprising that gRPC is most commonly used with protocol buffers even though there is no critical dependency between them.

gogoprotobuf is a fork of the original protocol buffers compiler for Go that has (had?) some improvements over the old one. However, it comes not without some downsides. We’ve accumulated random hacks overtime to make generated code compile and work. For example, we edit import statements with sed. This looks like an opportunity for improving code generation tools – perhaps more checks are needed? What’s more, it turns the whole code generation into a “house of cards” – remove one small part and the whole thing crumbles. But, on the other hand, it is not surprising that an unmaintained tool has a bug here and there.

Thanos started using gogoprotobuf at some time in the past. But, after some time it became unmaintained. At some point, the fine Vitess folk came up with their own protocol buffers compiler for the V2 API which has some nice optimizations that bring it up to par with the old gogoprotobuf performance. In addition, it has support for pooling memory on the unmarshaling side i.e. the receiver. The sender’s side still, unfortunately, cannot use pooling because gRPC SendMsg() method returns before the message gets on the wire. I feel like it’s a serious performance issue and I’m surprised that the gRPC developers still haven’t fixed this problem. This is the first learning that I wanted to share.

Another thing is about copying generated code. Sometimes the generated code is not perfect. So, the easiest and most straightforward way to fix this issue is to copy the generated code, change the parts that you don’t like, and commit it to Git. However, that is certainly far from perfect. We have made this mistake in the Thanos project. We’ve copied a generated struct and its methods to another file, and added our optimization. We call it the ZLabelSet. Here is its code. As you can see, it is an optimization to avoid an allocation. However, in this way, the struct members of generated code became deeply coupled with the rest of this custom code. Now it becomes much more painful to change the types of those members which kind of became an interface – this is because the v2 API does not support non-nullable struct members.

On the other hand, using interfaces in Go incurs extra performance costs so don’t try to optimize too heavily. Profile and always pick your battles.

This is the second lesson. Please try to not copy generated code and instead make your own protocol buffers compiler plugin or something. It is actually quite easy to do so.

Last but not least, I also wanted to talk about goals and focus. Ever since we’ve divided the whole project into as many small parts as possible, the main focus was on getting the existing test suite to pass successfully. However, that is not always the best idea. We ran into a problem where gogoprotobuf has an extension to use a more natural type for Go programmers in structs – time.Time, alas the same extension doesn’t exist in vanilla protocol buffers for Go. It has its own separate type – protobuf.Timestamp. Because the usage of timestamps is littered all over the Thanos codebase, we’ve run into a problem where we’ve accidentally defined a bunch of conversion functions between those two types. And they weren’t identical. So, we had to take a step back and look at the invariants. To be more exact time.Time defines an absolute time whereas protobuf.Timestamp stores the time passed since Unix epoch 0. Only after unifying the conversion functions, does everything work correctly. Keep in mind that those “small” parts of this project are thousand of lines added or removed so it’s really easy to get lost. For example, this is one pull request that got merged:

Screenshot of https://github.com/rahulii/thanos/pull/1 showing the diffstat

In conclusion, the third, more general learning is that sometimes it is better to take a step back and to look at how everything should work together instead of being fixated on one small part.

Perhaps in the future code generation will be replaced in some part by generics in Go 1.18 and future Go versions. That should make life easier. I also hope that we will pick up this work again soon and that I will be able to announce to everyone that we finally switched to the upstream version of the protocol buffers for Go. It seems like there is an appetite for that in our community so the future looks bright. We’ve already removed the gogoproto extensions from our .proto files and we are in the middle of removing the gogoproto compiler – https://github.com/rahulii/thanos/pull/2. Just need someone to finish all of this up. And to start using the pooling functionality in Thanos Query. Will it be you who will help us finish this work? 😍🍻