Custom Metric Retention Periods on Thanos

Setting custom metric retention periods on Thanos is one of the longest feature requests that we have had: https://github.com/thanos-io/thanos/issues/903. It seems like there is still no solution in sight but actually, it is already possible to have custom metrics retention periods. It is quite a simple idea but could be hard to implement if you do not have comfortable deployment tooling in place. You can achieve custom retention periods for different metrics in the following way:

  • Designate retention as a special (external) label that controls how long the metrics should be kept i.e. ensure that no metrics have this label
  • Send metrics with bigger retention over remote write to Thanos Receive instances that have retention external label set to the retention period
  • Set up multiple instances of Thanos Compactors with different retention periods and each of them needs to pick up blocks with those respective external labels
  • Add retention as another deduplication label on Thanos Query

In the end, all of your blocks should have some kind of retention as an external label and then you should have multiple Thanos Compactors for each stream of retention label.

Note that this whole setup assumes that you will not want to change the default retention for a big amount of metrics. I have found it to be true in most of the cases, in my experience. It is just anecdotal data but most of the time you’ll want around 30 – 60 days of retention by default, with some people wanting about a year’s worth of retention if they are doing some kind of analytics on that data e.g. they are trying to predict the number of requests. If you will want to change the retention of a big amount of metrics then this simple setup will not work and you will need to scale the receiving side i.e. the Receivers. But, that is out of the scope of this article.

Also, ideally you would want to avoid having to remote write anything at all and let Sidecar do its work with multiple Prometheus+Sidecar pairs, each having their own retention label. However, it might not be so easy to do for most people who do not have advanced configuration management set up on their systems.

The rest of the article focuses on a hacky way to achieve multiple retention periods for different metrics with the constraint that only one Prometheus node is in the picture.

Here is how this setup looks like in a graphic:

Let’s walk through the most important parts:

  • External labels and metric_relabel_config configuration on Prometheus. First, we need to set the label retention to a value such as 1mo which will indicate the default retention for metrics. There may be some extra external labels, that does not matter in our case. Do specify that default retention with:
global:
  external_labels:
    retention: 1mo
  • Set up Thanos Receive with “tenants” such as 12mo:
--receive.tenant-label-name="retention" --receive.default-tenant-id="12mo" --label=...

Add your extra external labels such as fqdn to identify this Thanos Receive node.

  • Set up remote writing to Thanos Receive in the Prometheus configuration. For example:
remote_write:
    - url: http://localhost:19291/api/v1/receive
  • Edit your Thanos Query to include retention as the deduplication label:
query ... --query.replica-label=retention
  • Set up multiple Thanos Compactors for each different retention with their own relabel configs. Here is an example for 12mo:
    - source_labels:
      - retention
      regex: "12mo"
      action: keep

And then you need to have the respective retention configuration on that Thanos Compactor:

--retention.resolution-1h=365d --retention.resolution-raw=365d --retention.resolution-5m=365d --selector.relabel-config=...

This assumes that there are 365 days in a year.

Repeat this configuration for each different retention external label that you might have.

At this point, all of the metrics are duplicated locally and in remote write with extra retention. Consider following the last point in this post.

  • (Optional) Enable metric_relabel_configs on your scraping target(-s) to avoid ingesting metrics with certain label names/values. As an alternative, you can use write_relabel_configs to only keep certain metrics sent to remote write storage that have certain patterns. For example, to only send metrics with label tenant="Team Very Important to external storage with 12mo retention, add the following configuration:
remote_write:
    - url: http://localhost:19291/api/v1/receive
      write_relabel_configs:
      - source_labels: [tenant]
        regex: Team Very Important
        action: keep

You could also work around this problem by having separate scrapers and some external system that feeds targets into your Prometheus according to the set retention with file_sd_configs or some other mechanism as mentioned at the beginning of the article.

As the last alternative, consider using the Prometheus Agent to have minimal storage on disk, and to send everything over remote write to Thanos Receivers.

I hope this helps. Let me know if you have any comments or suggestions!

What Is The Optimal Split Interval For query-frontend?

I ran into an interesting problem lately with the query-frontend component of Thanos. Here is how --query-range.split-interval is described in the --help output:

Part of query-frontend --help output showing what is the purpose of the split interval

Computer Science is full of trade-offs and this is one of them. Currently, the Prometheus engine executes each query in a separate goroutine as I have described here some time ago. This mechanism lets you kind of work around this problem because each PromQL query only looks for samples in the given range (unless you use the negative offset feature flag). So, if the split interval is lower then you will be able to leverage the CPU power more but with more round-trips for retrieving the same data. If the split interval is higher, then fewer round-trips will be needed to execute those queries but fewer CPU cores will be used.

Thus, it might make sense to reduce this interval to a very small value, especially if you are running in a data-center environment where network latency is very small and there is no cost associated with retrieving data from remote object storage. Unfortunately, but it is not so easy because there are some small problems with that :/

  • Thanos at the moment doesn’t support single-flight for index cache lookups. So, with a very small split interval, you might kill your Memcached with a ton of GET commands because each request will try to look up the same keys. This is especially apparent with compacted blocks where time-series typically span more than one day in a block;
  • Another thing is that it might lead to a retrieval of more data than is actually needed. A series in Thanos (and Prometheus) spans from some minimum timestamp to some maximum timestamp. During compaction, multiple series with identical labels are joined together in a new, compacted block. But, if it has some gaps or we are interested only in some data (which is the case most of the time, in my humble opinion) then performing such time range checks is only possible after decoding the data. Thus, we might be forced to retrieve some bigger part of the chunk than it might be really needed with a small split interval. Again, this is more visible with compacted blocks because their series usually span bigger time ranges.
Person thinking in front of the computer

So, given all of this, what’s the most optimal split interval? It should be around our typical block size but not too small so as to avoid the mentioned issues. The possible block sizes are defined here in Thanos. Typically if the Thanos Compactor component is working then you will get block sizes of 2 days after they have been for around 2 days in remote object storage. But, what about the even bigger blocks which are 14 days in length? I think that at that point the downsampling starts playing a part in this and we should not be really concerned with the aforementioned problems anymore.

In conclusion, I think most of the time you will want the split interval to be either 24h or 48h – the former if your queries most of the time are computation-heavy so that you could use more CPU cores, the latter if your queries are retrieval-heavy so that you could avoid over-fetching data and sending more retrieve/store operations than needed.

Please let me know if you have spotted any problems or if you have any suggestions! I am always keen to hear from everyone!