Dynamic hashrings with WAL in Thanos are bad

Hashrings are everywhere in distributed systems. Combined with a write-ahead log that uses an unbounded amount of RAM during WAL replay, they are a terrible idea. If there is a constant stream of metrics coming into Thanos, you want to push back on the producers of those metrics in case of one or a few node’s downtime. Having a dynamic hashring i.e. a hashring that immediately updates as soon as it notices that one node is down, prevents that. The same stream is now going to fewer nodes. And so on until, most likely, your whole stack collapses. In practice, this means that you should not remove a node from the hashring if it is unhealthy or unready.

To say it in short: the problem is unbounded memory usage. It can be rectified by limiting memory usage or by not having a dynamic hashring.

A picture is worth a thousand words so let’s show what it looks like:

Diagram showing the cascading failure with a dynamic hashring

The same thing could happen with Loki. However, there you can specify the maximum amount of memory that it can use during replay. If you are running into issues then consider lowering this setting as per your requirements.

wal:
  [replay_memory_ceiling: <int> | default = 4GB]

Perils of /metrics data assertions

In the Thanos project the e2e tests are written in part using assertions on the metrics data. I have encountered a few challenges with those assertions that I wanted to share with you.

Always check for the maximum possible value

Writing tests on /metrics data means your testing harness continuously checks /metrics data to see whether some metric equals some value. If the metric takes some time to reach the maximum value, you might erroneously write an equals check that checks for a smaller value. Later on, this will lead to a flaky test situation.

To fix this, I’d suggest running the flaky test locally and trying to write an assertion for a bigger value. Now try to run the same test again and see what happens. Also, look at the logic inside of your code – perhaps it would be possible to calculate the maximum value without using a hard-coded constant?

Check for specific metrics

By default, the end-to-end testing framework only accepts a metric name and sums up all matching series. I would encourage you to be maximally specific because metrics are not usually transactional. In other words, the user (in this case, the e2e testing framework) might see some results that are in the middle of being updated. For instance, if the state is constantly fluctuating then the sum will change but the sum of all matching metrics might never be equal to the desired value. Imagine that we have some code like this (pseudo-code):

foometric.set(123)
// <--- E2E testing framework request for /metrics page comes in here.
barmetric.set(5000)

If the metrics weren’t set previously then the sum can either be 123 or 5123. Hence, the best practice is usually to be as specific as possible with the label matchers. You can use this function to do that.