What Is The Optimal Split Interval For query-frontend?

I ran into an interesting problem lately with the query-frontend component of Thanos. Here is how --query-range.split-interval is described in the --help output:

Part of query-frontend --help output showing what is the purpose of the split interval

Computer Science is full of trade-offs and this is one of them. Currently, the Prometheus engine executes each query in a separate goroutine as I have described here some time ago. This mechanism lets you kind of work around this problem because each PromQL query only looks for samples in the given range (unless you use the negative offset feature flag). So, if the split interval is lower then you will be able to leverage the CPU power more but with more round-trips for retrieving the same data. If the split interval is higher, then fewer round-trips will be needed to execute those queries but fewer CPU cores will be used.

Thus, it might make sense to reduce this interval to a very small value, especially if you are running in a data-center environment where network latency is very small and there is no cost associated with retrieving data from remote object storage. Unfortunately, but it is not so easy because there are some small problems with that :/

  • Thanos at the moment doesn’t support single-flight for index cache lookups. So, with a very small split interval, you might kill your Memcached with a ton of GET commands because each request will try to look up the same keys. This is especially apparent with compacted blocks where time-series typically span more than one day in a block;
  • Another thing is that it might lead to a retrieval of more data than is actually needed. A series in Thanos (and Prometheus) spans from some minimum timestamp to some maximum timestamp. During compaction, multiple series with identical labels are joined together in a new, compacted block. But, if it has some gaps or we are interested only in some data (which is the case most of the time, in my humble opinion) then performing such time range checks is only possible after decoding the data. Thus, we might be forced to retrieve some bigger part of the chunk than it might be really needed with a small split interval. Again, this is more visible with compacted blocks because their series usually span bigger time ranges.
Person thinking in front of the computer

So, given all of this, what’s the most optimal split interval? It should be around our typical block size but not too small so as to avoid the mentioned issues. The possible block sizes are defined here in Thanos. Typically if the Thanos Compactor component is working then you will get block sizes of 2 days after they have been for around 2 days in remote object storage. But, what about the even bigger blocks which are 14 days in length? I think that at that point the downsampling starts playing a part in this and we should not be really concerned with the aforementioned problems anymore.

In conclusion, I think most of the time you will want the split interval to be either 24h or 48h – the former if your queries most of the time are computation-heavy so that you could use more CPU cores, the latter if your queries are retrieval-heavy so that you could avoid over-fetching data and sending more retrieve/store operations than needed.

Please let me know if you have spotted any problems or if you have any suggestions! I am always keen to hear from everyone!

How to Visualize Prometheus TSDB With Thanos

Have you ever tried to ls the Prometheus TSDB data directory just to understand what kind of data you have from the perspective of time? Here is an example listing:

drwxrwxr-x  3 gstatkevicius gstatkevicius  4096 Jun 26  2020 01EBRP7DA9N67RTT9AFE0KMXFJ
drwxrwxr-x  3 gstatkevicius gstatkevicius  4096 Jun 26  2020 01EBRP7DCM68X5SPQGK3T8NNER
drwxrwxr-x  2 gstatkevicius gstatkevicius  4096 Jun 26  2020 chunks_head
-rw-r--r--  1 gstatkevicius gstatkevicius     0 Apr 17  2020 lock
-rw-r--r--  1 gstatkevicius gstatkevicius 20001 Jan  2 14:01 queries.active
drwxr-xr-x  3 gstatkevicius gstatkevicius  4096 Jan  2 14:01 wal

We could assume that each block is 2 hours long. However, they can be compacted i.e. “joined” into bigger ones if that is enabled. Then, the blocks would not line up in a sequential way. Plus, this output of ls doesn’t even show the creation time of different blocks. We could theoretically get those timestamps with stat(2), statx(2) and some shell magic but that’s cumbersome. Overall, this doesn’t seem too helpful and we can do better.

Enter Thanos. I will not go over what it is in detail in this article to keep it terse however I am going to say that it extends Prometheus capabilities and lets you store TSDB blocks in remote object storage such as Amazon’s S3. Thanos has a component called the bucket block viewer. Even though it ordinarily only works with data in remote object storage but we can make it work with local data as well by using the FILESYSTEM storage type.

To use Thanos to visualize our local TSDB, we have to pass an object storage configuration file. If your Prometheus stores data in, let’s say, /prometheus/data then the --objstore.config-file needs to point to a file that has the following content:

type: FILESYSTEM
config:
  directory: "/prometheus/data"

To start up the web interface for the storage bucket, execute the following command:

thanos tools bucket web --objstore.config-file=my-config-file.yaml 

The web UI becomes available on http://localhost:10902. Here is how it looks like on a very simple (where all of the blocks are contiguous and of the same compaction level) TSDB:

Thanos Bucket Viewer showing local Prometheus TSDB blocks

Feel free to use Thanos as a way to explore your Prometheus! The support for viewing vanilla Prometheus blocks has been merged on January 10th, 2021 so only new versions after this date will have this functionality. So, please mind this date and if this does not work for you then please update your Thanos.