Surprising Behavior With inotify And ioutil

How using ioutil.WriteFile() with inotify in tests might make them flaky

Recently I was working on trying to fix one flaky test of the “reloader” component in the Thanos project. It was quite a long-standing one – almost took a whole year to fix this issue. It is not surprising as it is quite tricky. But, before zooming into the details, let’s talk about what does this test does and what other systems come into play.

Simply put, Thanos Sidecar works as a sidecar component for Prometheus that not just proxies the requests to it, captures the blocks produced by Prometheus & uploads them to remote object storage, but the Sidecar can also automatically reload your Prometheus instance if some kind of configuration files change. For that, it uses the inotify mechanism on the Linux kernel. You can read more about inotify itself here. Long story short, using it you can watch some files and get notifications when something changes e.g. new data gets written to the files.

The test in question is testing that reloader component. It is testing whether it sends those “reload” HTTP requests successfully because of certain simulated events and whether it properly retries failed requests. It had emulated changed files with the ioutil.WriteFile() call before the fix. However, during the tests, it sometimes had happened that the number of gotten HTTP calls versus what is expected did not match. After that, I looked at the events that the watcher had gotten via inotify and, surprisingly enough, sometimes some writes were missing or there were duplicates of them. Here is how it had looked like during these two different runs:

"/tmp/TestReloader_DirectoriesApply249640168/001/rule2.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule2.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule-dir/rule4.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply249640168/001/rule-dir/rule4.yaml": WRITE

"/tmp/TestReloader_DirectoriesApply364923838/001/rule2.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule2.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule1.yaml": WRITE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule3.yaml": CREATE
"/tmp/TestReloader_DirectoriesApply364923838/001/rule-dir/rule4.yaml": WRITE

You can see that one time there were two writes, the other time – only one. Apparently, inotify is permitted to coalesce two or more write events into one if they happen consecutively “very fast”:

If successive output inotify events produced on the inotify file descriptor are identical (same wd, mask, cookie, and name), then they are coalesced into a single event if the older event has not yet been read (but see BUGS). This reduces the amount of kernel memory required for the event queue, but also means that an application can’t use inotify to reliably count file events.

https://man7.org/linux/man-pages/man7/inotify.7.html

Then, I started looking into the ioutil.WriteFile() function’s code because that’s what we had been using to do writes. And inside of it, I have found this:

f, err := OpenFile(name, O_WRONLY|O_CREATE|O_TRUNC, perm)
if err != nil {
	return err
}
_, err = f.Write(data)

if err1 := f.Close(); err1 != nil && err == nil {
	err = err1
}

return err

And this is where the surprising behavior comes from – opening a file with O_TRUNC also counts as a write:

IN_MODIFY (+)
          File was modified (e.g., write(2), truncate(2)).

Now this explains everything – due to the usage of O_TRUNC and then writing afterward, ioutil.WriteFile() can either generate one or two inotify events to the watcher depending on how fast it can read them. It is easy to avoid this issue – one simple way is to create a temporary file with ioutil.TempDir() and ioutil.TempFile(), and then move it into place with os.Rename.

Running systemd Units Only If Certain CPU Features Are Available

systemd's logo (C) https://brand.systemd.io/
systemd’s logo

Do you have some kind of application that should only be executed when the user’s CPU has a certain feature? Or maybe your application has horrible performance unless some certain instructions are available? As you might know already, the Linux kernel exposes this information via /proc/cpuinfo:

processor       : 15
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 96
model name      : AMD Ryzen 7 PRO 4750U with Radeon Graphics                                                                                                                                                                                  
...
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd _apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce to poext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xge tbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Where does this data come from? One source of information is the cpuid instruction. On the x86 architecture systemd starting from v248 is now able to start or stop an unit based on the features exposed by the CPUID instruction. Enter ConditionCPUFeature. On CPUs that do not have the CPUID instruction i.e. ARM CPUs, it is assumed that the CPU implements no features and thus any potential CPUFeature conditions will fail.

The feature strings that systemd understands could be found here. So, for example to ensure that your service only runs when SSE 4.2, you should add this to your unit file:

[Unit]
ConditionCPUFeature=sse4_2

To negate this condition, put a exclamation mark in front. So, the same example but inverted would look like this:

[Unit]
ConditionCPUFeature=!sse4_2

This would ensure that the unit could only run if SSE 4.2 is not available.

Finally, it is worth mentioning that this assumes a homogenous system – i.e. all of the available CPU cores implement the same CPU features. In fact, the Intel CPU manuals & other manuals only typically assume that this is case because that’s true in most of the cases, as far as I can tell.

This could be improved in the future by checking all of the available CPU’s CPUID flags. On Linux this could be implemented by reading /dev/cpu/*/cpuid or explicitly scheduling a process on different CPUs & reading the CPU flags then. Or maybe systemd could even provide a feature where a unit could be scheduled only on CPUs which provide one or more provided features. If you are interested to know more then this article here provides good information.

I liked working on this feature in my own free time because I wanted to learn more about how these things work. I knew a bit about CPUID in the past however I never delved deep into it. Implementing this gave me valuable information about how stuff like identifying CPU’s features functions.