Starting up GitHub sponsors and some recent postings work

Hello everyone! I am happy to announce that I’ve set up GitHub sponsors on my profile. If you want to support my blog or my work on Thanos/Prometheus, and you have some free money then now you have a way to throw some money at these projects. Let’s see if I will even get one sponsor. I was thinking that maybe I should work on some custom features that could be behind a paywall. Let’s see when I will have some time to work on them.

I haven’t written anything on my blog for quite some time. I think it’s high time I’ve revived it. Writer’s paralysis probably happened to me, so I haven’t posted anything. Somehow I kept thinking about many topics but was afraid of writing about them and clicking “Publish”. But now it’s time to not be afraid and do that 🙂

Probably the most exciting stuff that I have worked on (and still do) recently is postings encoding improvements in Prometheus & Thanos. It’s now possible to specify a custom postings encoder in the Prometheus compactor: https://github.com/prometheus/prometheus/pull/13242. After https://github.com/prometheus/prometheus/pull/13567 it will even be possible to use a custom postings decoder. The postings data structure sits at the core of the Prometheus TSDB – it is used for storing sets of sorted integers. Whenever someone specifies some label matcher in a query e.g. {foo="bar"} then Prometheus goes through the set of series (postings) which have foo="bar" in their labels. So, it is paramount to make this data structure as efficient as possible.

Currently, each integer is simply stored using 4 bytes. It’s possible to be much better than that. For example, if you have a set of integers 1, 2, 3, 4..., 10 then it’s enough to only say that there’s a run of 10 integers starting from 1. Over time, many more techniques for compression were invented.

I have researched what is available and found out that the most popular paper (probably) is this one https://arxiv.org/abs/1401.6399 by Daniel Lemire & others. I love his work in particular because he always puts up the source code for his paper. It’s a huge help! I wish more people had done that.

We have a few constraints in the Thanos/Prometheus world:

  • We should read posting lists only in one direction i.e. we shouldn’t need to read them twice. Some encoding formats force the reader to read twice like the patch frame-of-reference variants. This constraint is needed because we would like to avoid allocating memory for the whole postings list if possible to save a lot of memory. In the Thanos world, the list could be easily hundreds of millions in size.
  • The intersection must be very fast. Prometheus/Thanos will do intersections many more times than encode/decode data. It’s not uncommon to have 3+ label matchers in a single query.

From all of the things I’ve looked at, S4-BP128-D4 and roaring bitmaps look the most promising. The latter is used by a lot of similar projects already like M3DB. The former might be not so popular but it is specifically designed for SIMD which gives us very fast encoding/decoding.

I even started writing a Go version of S4-BP128-D4 but I haven’t finished it, yet: https://github.com/GiedriusS/go-bp. So, I am opting to try roaring bitmaps first. Even then it would be a huge improvement because bitmaps allow VERY fast intersection through the bitwise AND operation. The current intersection algorithm needs to step through each element in given postings.

I recently wrote a small program to compare postings compression on Prometheus index files: https://github.com/thanos-io/postings-analyzer. You can see that it is possible to save around ~70% in postings size using S4-BP128-D4 and ~47% using roaring bitmaps. These numbers were consistent in my tests using index files from production. In my case, this would lead to shaving about 30% of the whole index file. Of course, most notably my index files didn’t have any runs of numbers so run-length encoding wasn’t used in roaring bitmaps, and so one could argue that I don’t have a diverse data set in these tests. Perhaps there is some weird setup out there where RLE would be useful? I tried to gather sample index files on CNCF Slack to no avail – no one stepped up to upload them for me.

Either way, all of this work is very promising and I hope to have a feature flag in Thanos soon which would allow using roaring bitmaps!

Reproducing flaky Go tests using Linux cgroups and systemd

Sometimes, the -race option might not be enough to trigger/debug races in Go tests. You might have time.Sleep() in a test thinking that some event will surely trigger in some time. You might run it on GitHub actions on a shared runner. Alas, you don’t see that event happen. What could have caused this? Most of the time it is because of the minimal CPU time allocated for your tests. The runners are shared between many projects and thus sometimes CPU might be very split between many processes.

Reproducing a limited CPU environment locally might not be the most straightforward task. Go tests might spawn child processes or even Docker containers in case of end-to-end tests. Facilities for limiting available processor time are different between operating systems. In this post, I will tell you how to do this easily using systemd-run and https://github.com/efficientgo/e2e. The advice regarding end-to-end tests is applicable to any e2e testing framework that uses Docker underneath. I have tried using https://manpages.ubuntu.com/manpages/xenial/man1/cpulimit.1.html cpulimit(1) before however it always continuously sends SIGSTOP/SIGCONT signals to the processes and that is very annoying. Also, from my experience, it hogs the CPU for some reason so it’s a non-goer. Let’s use cgroups which was exactly made for this – controlling resources available to processes on Linux systems! Also, this is the same exact mechanism used by Kubernetes.

systemd-run is a nice wrapper around running processes in individual cgroups. It is available in Linux distributions that use the systemd system daemon. You can find its manual page here. I came up with this concoction that runs a Go test in a CPU limited systemd slice:

systemd-run -E GO111MODULE=on -E GOPATH="${GOPATH}" --working-directory="$(pwd)" -p CPUQuota=10% -P -G --user /bin/sh -c '/bin/go test -count=1 -v -timeout 10m -run ^TestStoreGatewayBytesLimit$ github.com/thanos-io/thanos/test/e2e'

This command will run go test in a temporary systemd unit with a tty attached to it i.e. it is interactive. Also, that temporary unit will be recycled after the test. Meaning that you can simply run this command over and over in a shell of your choice. The 10% is the amount of CPU time you want to allocate to this test.

If you are using some kind of e2e testing framework for Go that uses Docker then you can achieve this for spawned processes through the --cpus parameter to docker run. efficientgo/e2e makes this easy for you by providing the ability to set an environment variable that is used as the value for the --cpus parameter. Simply set E2E_DOCKER_CPUS in the Go test through t.Setenv("E2E_DOCKER_CPUS", ...) or do export E2E_DOCKER_CPUS=0.1 before-hand.

All in all, the Linux operating system provides a great way of allocating CPU time available through processes through cgroupv2. It is just a matter of how we could put our processes that run tests into CPU time-limited control groups. This post provides two ways that I found useful. So far I fixed one flaky test but there are much more of them. I hope this post will be useful for you too!