What Is The Optimal Split Interval For query-frontend?

I ran into an interesting problem lately with the query-frontend component of Thanos. Here is how --query-range.split-interval is described in the --help output:

Part of query-frontend --help output showing what is the purpose of the split interval

Computer Science is full of trade-offs and this is one of them. Currently, the Prometheus engine executes each query in a separate goroutine as I have described here some time ago. This mechanism lets you kind of work around this problem because each PromQL query only looks for samples in the given range (unless you use the negative offset feature flag). So, if the split interval is lower then you will be able to leverage the CPU power more but with more round-trips for retrieving the same data. If the split interval is higher, then fewer round-trips will be needed to execute those queries but fewer CPU cores will be used.

Thus, it might make sense to reduce this interval to a very small value, especially if you are running in a data-center environment where network latency is very small and there is no cost associated with retrieving data from remote object storage. Unfortunately, but it is not so easy because there are some small problems with that :/

  • Thanos at the moment doesn’t support single-flight for index cache lookups. So, with a very small split interval, you might kill your Memcached with a ton of GET commands because each request will try to look up the same keys. This is especially apparent with compacted blocks where time-series typically span more than one day in a block;
  • Another thing is that it might lead to a retrieval of more data than is actually needed. A series in Thanos (and Prometheus) spans from some minimum timestamp to some maximum timestamp. During compaction, multiple series with identical labels are joined together in a new, compacted block. But, if it has some gaps or we are interested only in some data (which is the case most of the time, in my humble opinion) then performing such time range checks is only possible after decoding the data. Thus, we might be forced to retrieve some bigger part of the chunk than it might be really needed with a small split interval. Again, this is more visible with compacted blocks because their series usually span bigger time ranges.
Person thinking in front of the computer

So, given all of this, what’s the most optimal split interval? It should be around our typical block size but not too small so as to avoid the mentioned issues. The possible block sizes are defined here in Thanos. Typically if the Thanos Compactor component is working then you will get block sizes of 2 days after they have been for around 2 days in remote object storage. But, what about the even bigger blocks which are 14 days in length? I think that at that point the downsampling starts playing a part in this and we should not be really concerned with the aforementioned problems anymore.

In conclusion, I think most of the time you will want the split interval to be either 24h or 48h – the former if your queries most of the time are computation-heavy so that you could use more CPU cores, the latter if your queries are retrieval-heavy so that you could avoid over-fetching data and sending more retrieve/store operations than needed.

Please let me know if you have spotted any problems or if you have any suggestions! I am always keen to hear from everyone!

Surprising Behavior With inotify And ioutil

How using ioutil.WriteFile() with inotify in tests might make them flaky

Recently I was working on trying to fix one flaky test of the “reloader” component in the Thanos project. It was quite a long-standing one – almost took a whole year to fix this issue. It is not surprising as it is quite tricky. But, before zooming into the details, let’s talk about what does this test does and what other systems come into play.

Simply put, Thanos Sidecar works as a sidecar component for Prometheus that not just proxies the requests to it, captures the blocks produced by Prometheus & uploads them to remote object storage, but the Sidecar can also automatically reload your Prometheus instance if some kind of configuration files change. For that, it uses the inotify mechanism on the Linux kernel. You can read more about inotify itself here. Long story short, using it you can watch some files and get notifications when something changes e.g. new data gets written to the files.

The test in question is testing that reloader component. It is testing whether it sends those “reload” HTTP requests successfully because of certain simulated events and whether it properly retries failed requests. It had emulated changed files with the ioutil.WriteFile() call before the fix. However, during the tests, it sometimes had happened that the number of gotten HTTP calls versus what is expected did not match. After that, I looked at the events that the watcher had gotten via inotify and, surprisingly enough, sometimes some writes were missing or there were duplicates of them. Here is how it had looked like during these two different runs:

"/tmp/TestReloader_DirectoriesApply249640168/001/rule2.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule2.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule-dir/rule4.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule-dir/rule4.yaml": WRITE

"/tmp/TestReloader_DirectoriesApply364923838/001/rule2.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule2.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule-dir/rule4.yaml": WRITE

You can see that one time there were two writes, the other time – only one. Apparently, inotify is permitted to coalesce two or more write events into one if they happen consecutively “very fast”:

If successive output inotify events produced on the inotify file descriptor are identical (same wd, mask, cookie, and name), then they are coalesced into a single event if the older event has not yet been read (but see BUGS). This reduces the amount of kernel memory required for the event queue, but also means that an application can’t use inotify to reliably count file events.

https://man7.org/linux/man-pages/man7/inotify.7.html

Then, I started looking into the ioutil.WriteFile() function’s code because that’s what we had been using to do writes. And inside of it, I have found this:

f, err := OpenFile(name, O_WRONLY|O_CREATE|O_TRUNC, perm)
if err != nil {
	return err
}
_, err = f.Write(data)

if err1 := f.Close(); err1 != nil && err == nil {
	err = err1
}

return err

And this is where the surprising behavior comes from – opening a file with O_TRUNC also counts as a write:

IN_MODIFY (+)
          File was modified (e.g., write(2), truncate(2)).

Now this explains everything – due to the usage of O_TRUNC and then writing afterward, ioutil.WriteFile() can either generate one or two inotify events to the watcher depending on how fast it can read them. It is easy to avoid this issue – one simple way is to create a temporary file with ioutil.TempDir() and ioutil.TempFile(), and then move it into place with os.Rename.