Perils of /metrics data assertions

In the Thanos project the e2e tests are written in part using assertions on the metrics data. I have encountered a few challenges with those assertions that I wanted to share with you.

Always check for the maximum possible value

Writing tests on /metrics data means your testing harness continuously checks /metrics data to see whether some metric equals some value. If the metric takes some time to reach the maximum value, you might erroneously write an equals check that checks for a smaller value. Later on, this will lead to a flaky test situation.

To fix this, I’d suggest running the flaky test locally and trying to write an assertion for a bigger value. Now try to run the same test again and see what happens. Also, look at the logic inside of your code – perhaps it would be possible to calculate the maximum value without using a hard-coded constant?

Check for specific metrics

By default, the end-to-end testing framework only accepts a metric name and sums up all matching series. I would encourage you to be maximally specific because metrics are not usually transactional. In other words, the user (in this case, the e2e testing framework) might see some results that are in the middle of being updated. For instance, if the state is constantly fluctuating then the sum will change but the sum of all matching metrics might never be equal to the desired value. Imagine that we have some code like this (pseudo-code):

foometric.set(123)
// <--- E2E testing framework request for /metrics page comes in here.
barmetric.set(5000)

If the metrics weren’t set previously then the sum can either be 123 or 5123. Hence, the best practice is usually to be as specific as possible with the label matchers. You can use this function to do that.

Reproducing flaky Go tests using Linux cgroups and systemd

Sometimes, the -race option might not be enough to trigger/debug races in Go tests. You might have time.Sleep() in a test thinking that some event will surely trigger in some time. You might run it on GitHub actions on a shared runner. Alas, you don’t see that event happen. What could have caused this? Most of the time it is because of the minimal CPU time allocated for your tests. The runners are shared between many projects and thus sometimes CPU might be very split between many processes.

Reproducing a limited CPU environment locally might not be the most straightforward task. Go tests might spawn child processes or even Docker containers in case of end-to-end tests. Facilities for limiting available processor time are different between operating systems. In this post, I will tell you how to do this easily using systemd-run and https://github.com/efficientgo/e2e. The advice regarding end-to-end tests is applicable to any e2e testing framework that uses Docker underneath. I have tried using https://manpages.ubuntu.com/manpages/xenial/man1/cpulimit.1.html cpulimit(1) before however it always continuously sends SIGSTOP/SIGCONT signals to the processes and that is very annoying. Also, from my experience, it hogs the CPU for some reason so it’s a non-goer. Let’s use cgroups which was exactly made for this – controlling resources available to processes on Linux systems! Also, this is the same exact mechanism used by Kubernetes.

systemd-run is a nice wrapper around running processes in individual cgroups. It is available in Linux distributions that use the systemd system daemon. You can find its manual page here. I came up with this concoction that runs a Go test in a CPU limited systemd slice:

systemd-run -E GO111MODULE=on -E GOPATH="${GOPATH}" --working-directory="$(pwd)" -p CPUQuota=10% -P -G --user /bin/sh -c '/bin/go test -count=1 -v -timeout 10m -run ^TestStoreGatewayBytesLimit$ github.com/thanos-io/thanos/test/e2e'

This command will run go test in a temporary systemd unit with a tty attached to it i.e. it is interactive. Also, that temporary unit will be recycled after the test. Meaning that you can simply run this command over and over in a shell of your choice. The 10% is the amount of CPU time you want to allocate to this test.

If you are using some kind of e2e testing framework for Go that uses Docker then you can achieve this for spawned processes through the --cpus parameter to docker run. efficientgo/e2e makes this easy for you by providing the ability to set an environment variable that is used as the value for the --cpus parameter. Simply set E2E_DOCKER_CPUS in the Go test through t.Setenv("E2E_DOCKER_CPUS", ...) or do export E2E_DOCKER_CPUS=0.1 before-hand.

All in all, the Linux operating system provides a great way of allocating CPU time available through processes through cgroupv2. It is just a matter of how we could put our processes that run tests into CPU time-limited control groups. This post provides two ways that I found useful. So far I fixed one flaky test but there are much more of them. I hope this post will be useful for you too!