What Is New In Thanos 0.5.0

Thanos logo

Thanos 0.5.0-rc.0 has been recently released and the final 0.5.0 version is just around the corner. It is a good opportunity to look back on what has changed since the last version. I hope that I will be able to present a good enough perspective since I have been recently appointed as a maintainer of the project. Thanks to everyone who has contributed pull requests and bug reports 💖This could not have been done without you.

As always, some things might still change between the RC release and the final one so keep an eye on the official change log.

Removal of Gossip

This is a huge release in terms of gossip. Before this, nodes running Thanos used to be able to communicate between each other to determine where queries should go. This was replaced by the file and DNS SD akin to what Prometheus has.

Thus the complexity of the deployments has been greatly reduced, the code base has become much clearer. Also, some flaky tests have been removed in the process since sometimes Circle CI‘s servers lagged a bit and certain deadlines became exceeded. 🎉

To find out more about file and DNS SD, please refer to this documentation.

Prometheus / TSDB dependencies update

One of the ways Thanos uses the Prometheus code is via the libraries that they produce. Thus, they need to be updated periodically since new versions of Prometheus really come out rapidly.

In this 0.5.0 release, the Prometheus compatibility was bumped to 2.9.2 (2.10.0 is already out however we only test with 2.9.0 and other past versions 😮). With the newer library versions, a bunch of performance improvements was made. Also, some minor fixes with regards to file descriptor leaks were made in the error paths of some functions.

Also, the thing that I love the most about the updates of the dependencies is the new and updated web UI. Before this, it was really hard to use it since clicking anywhere would have made your browser start loading a lot of data. Now it is smooth as butter.

Updated minio-go, new S3 options

minio-go library that is used for communicating with S3 remote object storage has been updated to a new version. It fixes some errors with regards to retrying i.e. when certain HTTP status codes were returned, minio-go thought that they were not retry-able even though they are. This should fix some problems that users were seeing when their Thanos Compactor suddenly restarts.

Also, the ability to modify the timeout of waiting for response headers was added. Even if with all of the fixes you get those problems, most likely what you need is to increase this timeout. Sometimes network just lags a little bit and you need to enlarge this value.

Moved to Go 1.12.5

The Go version that was used with 0.4.0 uses a new memory allocator. Roughly speaking, it started using madvise which lead to memory usage being reported a bit higher because it was not so quickly released from the process. In Go 1.12.5, it has been improved a lot and it is mostly back to espousing the same characteristics as before.

You can find more information here.

Swift: cross-domain authentication added

Swift is the OpenStack’s storage technology and some time ago a new API has been rolled out that is not backward-compatible. With it, userDomainID, userDomainName, projectDomainID, projectDomainName were added. The outdated terms tenantID, tenantName are deprecated and have been replaced by projectID, projectName. To find out more, please check out the OpenStack documentation.

Critical index cache fixes

One critical bug with the index cache has been fixed. It’s essentially a classical case of a race between the time of checking and time of using: we were doing a Get operation (lock -> get -> unlock) and then a Set operation (lock -> set -> unlock). It is really hard to spot, though.

Also, some tests were added which test the case of updating an existing item in the cache. In sum, I hope that this means that finally there will be no more bugs with that in Thanos.

Sidecar is not longer blocking for custom Prometheus builds

Sidecar recently got a new feature where /api/v1/flags was checked to see if certain flag were configured like they were supposed to. However, before it, the version of Prometheus was checked. Unfortunately, but that does not always work since users can build their own versions of Prometheus with custom versions.

This use-case has been accounted for in this version. Now, we just check that end-point no matter what, without checking the version before it. If 404 is returned then we simply skip this step and log a message about it.

Thanos Store now properly selects resolutions of blocks

There was an issue where with downsampled data and with the --query.auto-downsampling parameter turned on sometimes data has not been properly returned. Essentially, one function did not account for possible gaps in downsampled data.

Property-based tests for these cases with gopter were added.

Thanos Query handles duplicated stores

A particular edge case was fixed where querier nodes could have potentially changed their external labels in place and the UI did not account for that. Now we check for duplicate external labels before checking whether a node which implements the Store API exists or not.

Minor (?) RAM usage improvements

Instances of json.Decoder were converted into json.Unmarshal . One user has reported huge improvements however from my small ad-hoc test it seems like they have not (maybe just a little bit). In general, this is a good change because the former is more suited for JSON streams whereas Thanos uses no such thing – we wholly download and unmarshal JSON files. Find more information here.

Push Vs. Pull In Monitoring Systems

Intro

There are two different schools of thought when thinking about how metrics are ingested into a monitoring system: either the metrics get pushed (usually via UDP) into the system or they get pulled (usually via HTTP). I might not mention that I have that in mind when saying “push or pull based systems” in some places to keep the article terse.

The push method is used in systems such as Graphite whereas the pull method is used by monitoring systems like Prometheus.

Which one is better? Just like with everything in life – there is no clear-cut answer and both sides have very strong arguments in favor of them. I will try to look through them.

It will mostly be a rehash of the arguments that I have presented in the Kaunas University of Technology when I have introduced the Prometheus monitoring system.

Arguments

In Favor Of Pull: Easier To Control The Authenticity and Amount of Data

When pulling the data we can be sure of the authenticity of the data since the server itself is which initiates the connection. I think that it makes the data path much clearer since most of the users nowadays have routers behind their public IP addresses and we might get mistaken easily about as to whether the data actually came from.

Let me try to clarify this point. With TCP pull-based systems, the metrics need to be directly accessible i.e. the port on which metrics data is available is always listening, whereas in a push-based system temporary connections are used which disappear and appear very swiftly.

Prometheus scrape target list

Also, it makes it easier to plan the capacity of pull-based systems since the exact targets from which metric data will be gathered is known in advance. On the other hand, on push-based systems, any kind of system can push to the metric gathering server. This could be fixed by using a whitelist of servers from which to accept data but most push-based systems do not support that. Plus, we are considering the characteristics of two different models and not their implementations.

In Favor Of Push: Easier To Implement Replication To Different Ingestion Points

Since it is all initiated by the client itself it becomes easier to replicate the same traffic to different servers. You just need to transmit it to more than one target IP address.

One of the most popular monitoring systems, Graphite, that is based on pushing has this on their website:

“Easy” transmission is one of the most prominent Graphite features

One of its components – Carbon – has things such as a replication factor, relay method, and so on, which makes it easy to start doing such a thing. It is really much easier to do that instead of standing up another instance of, for example, Prometheus.

Also, consider the fact that all of the receivers will get the same exact data. If you would spin up two different instances of Prometheus (which uses the HTTP pull method) then they most likely will not have the same exact data.

Firstly, the timestamp will be different. In the case of Graphite, the timestamp must be encoded inside of the data (it is optional in Prometheus). What is more, the values of the time series will most likely be different since scrapes the majority of the time will not happen at the same time due to the added jitter at the start of the scraping.

In Favor Of Pull: Easier to Encrypt The Traffic

It is very easy to put a TLS terminating reverse proxy in front of an ordinary HTTP server which serves metrics, and we could even use something like letsencrypt to automatically get a certificate if it is a public facing system or a certificate from a private CA that everyone on your intranet trusts. Software like Caddy makes it as easy as it could get.

Yes, it is also possible to use client-side TLS but it is error-prone and adds a lot of clutter to the code base. What would you rather have:

  • A simple HTTP server in your code or…
  • A client that supports client-side TLS to connect to another server to send your metrics there?

Most people would opt for the first option. The reasons why doing this encryption on the client software is a bad idea are the same why in general doing client-side TLS is bad. For example, you could look at this article on reasons why. Also, this answer by Polynomial on Stackoverflow:

The primary reason is that 95% of internet users have no idea what a client-side certificate is, let alone how to use one. Some users can barely manage to use usernames and passwords, and most still don’t bother with two-factor authentication. It’s also a hassle to install a client certificate on separate devices (desktop, laptop, tablet, smartphone, etc.) for authentication to a single service.

I would argue that more or less the same story applies to us, programmers. And we would also love to move that encryption complexity out of our client code and into a separate server. This is feasible only with the pull-based model.

In Favor Of Push: Easy To Model Shortlived Batch-Jobs

In the push method, the client itself pushes the metrics to the server. On the other hand, in the pull method, the server periodically probes the clients and gathers their metrics. In Prometheus, this is called the scrape period. This has a (painful) result – if the client does not survive for longer than the period, the metrics are lost. This picture explains how the loop works like:

Prometheus scrape loop and alerts explained

In the push method, we do not have a problem with this since we can send metrics whenever batch-jobs finish. Of course, Prometheus tries to solve this. We have what is called a pushgateway.

Essentially, it is a receiver of metrics that periodically gets scraped by Prometheus a.k.a. Graphite in Prometheus. It also works the same way as graphite-exporter.

However, they have their own problems. For example, metrics might disappear if the push gateway goes down. Or the metric values might get lost if the clients update them faster enough than Prometheus can scrape them.

The push method and Graphite, by extension, does not suffer from this problem.

In Favor Of Pull: Easier To Retrieve Data On Demand (And Debug)

Having a pull method on top of TCP (HTTP) means that it is very easy to retrieve data on demand and debug the problems. Especially, if the metrics data is human-readable and easily understandable like the format used by Prometheus.

This gives you the opportunity to easily distinguish between the errors on the client side and the server side. In the push method, our hands would be kind of tied behind our back because if we were not receiving any metrics then it means one of two things:

  • there is something wrong with the network
  • there is something wrong with the client

With the push (TCP/HTTP) method, we could easily check between these two by simply going with our web browser to the IP address and port where we could find the metrics data.

If we would get a TCP connection reset then it would mean that the network is OK but there is something wrong with the client. If we would get no response whatsoever then it would mean that something’s wrong with the network. Of course, this depends on the clients sending back a TCP_RST when a port is closed but that’s how the majority of machines act.

In Favor Of Push: Might Potentially Be More Performant

Push methods typically use UDP whereas pull methods are based on TCP (HTTP). What this means is that we could potentially push metrics more performantly than pull them. This is due to the fact that there is way less overhead for managing UDP connections. For example, there is no need to check if the message that you have sent to your peer has been actually received and in the correct order.

However, with TCP support baked into much of the commodity network cards, and operating systems which use the hardware acceleration are everywhere, the overhead is probably not as big as it would have been back in the 90s, for example.

Conclusion

Both of these two models have their pros and cons. However, it seems that the pull-based model won since it offers just a little bit more reliability (especially when talking about very large scale deployments) and that it needs just a bit less number of workarounds to satisfy all of the possible metrics gathering use cases.

It’s probably not without a reason that systems such as Prometheus became very popular which is a descendant of the Borgmon monitoring system. And, as we know, Borgmon was used to monitor the work scheduling system called Borg at Google which later became the system that we all know and love – Kubernetes.