Starting up GitHub sponsors and some recent postings work

Hello everyone! I am happy to announce that I’ve set up GitHub sponsors on my profile. If you want to support my blog or my work on Thanos/Prometheus, and you have some free money then now you have a way to throw some money at these projects. Let’s see if I will even get one sponsor. I was thinking that maybe I should work on some custom features that could be behind a paywall. Let’s see when I will have some time to work on them.

I haven’t written anything on my blog for quite some time. I think it’s high time I’ve revived it. Writer’s paralysis probably happened to me, so I haven’t posted anything. Somehow I kept thinking about many topics but was afraid of writing about them and clicking “Publish”. But now it’s time to not be afraid and do that 🙂

Probably the most exciting stuff that I have worked on (and still do) recently is postings encoding improvements in Prometheus & Thanos. It’s now possible to specify a custom postings encoder in the Prometheus compactor: https://github.com/prometheus/prometheus/pull/13242. After https://github.com/prometheus/prometheus/pull/13567 it will even be possible to use a custom postings decoder. The postings data structure sits at the core of the Prometheus TSDB – it is used for storing sets of sorted integers. Whenever someone specifies some label matcher in a query e.g. {foo="bar"} then Prometheus goes through the set of series (postings) which have foo="bar" in their labels. So, it is paramount to make this data structure as efficient as possible.

Currently, each integer is simply stored using 4 bytes. It’s possible to be much better than that. For example, if you have a set of integers 1, 2, 3, 4..., 10 then it’s enough to only say that there’s a run of 10 integers starting from 1. Over time, many more techniques for compression were invented.

I have researched what is available and found out that the most popular paper (probably) is this one https://arxiv.org/abs/1401.6399 by Daniel Lemire & others. I love his work in particular because he always puts up the source code for his paper. It’s a huge help! I wish more people had done that.

We have a few constraints in the Thanos/Prometheus world:

  • We should read posting lists only in one direction i.e. we shouldn’t need to read them twice. Some encoding formats force the reader to read twice like the patch frame-of-reference variants. This constraint is needed because we would like to avoid allocating memory for the whole postings list if possible to save a lot of memory. In the Thanos world, the list could be easily hundreds of millions in size.
  • The intersection must be very fast. Prometheus/Thanos will do intersections many more times than encode/decode data. It’s not uncommon to have 3+ label matchers in a single query.

From all of the things I’ve looked at, S4-BP128-D4 and roaring bitmaps look the most promising. The latter is used by a lot of similar projects already like M3DB. The former might be not so popular but it is specifically designed for SIMD which gives us very fast encoding/decoding.

I even started writing a Go version of S4-BP128-D4 but I haven’t finished it, yet: https://github.com/GiedriusS/go-bp. So, I am opting to try roaring bitmaps first. Even then it would be a huge improvement because bitmaps allow VERY fast intersection through the bitwise AND operation. The current intersection algorithm needs to step through each element in given postings.

I recently wrote a small program to compare postings compression on Prometheus index files: https://github.com/thanos-io/postings-analyzer. You can see that it is possible to save around ~70% in postings size using S4-BP128-D4 and ~47% using roaring bitmaps. These numbers were consistent in my tests using index files from production. In my case, this would lead to shaving about 30% of the whole index file. Of course, most notably my index files didn’t have any runs of numbers so run-length encoding wasn’t used in roaring bitmaps, and so one could argue that I don’t have a diverse data set in these tests. Perhaps there is some weird setup out there where RLE would be useful? I tried to gather sample index files on CNCF Slack to no avail – no one stepped up to upload them for me.

Either way, all of this work is very promising and I hope to have a feature flag in Thanos soon which would allow using roaring bitmaps!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.